Analyze life expectancy

The life expectancy data is from kaggle, https://www.kaggle.com/kumarajarshi/life-expectancy-who. It contains 193 countries data. There are 22 columns in the dataset. All predicting variables was then divided into several broad categories:Immunization related factors, Mortality factors, Economical factors and Social factors.Through the analysis, we want to answer the following questions:

  • The life expectancy between developed countries and developing countries are siginificant different?
  • Which factors affect the life expectancy?
  • Will Immunization factor play a role in the average age of life expectancy?
  • How adult mortality and infant mortality affect life expectancy?
  • How economic factors affect life expectancy?
  • Do social factors have same effect on
In [1]:
## need to do: some null value can be replaced with mean, some population can be obtained from internet,do not need
## to delete all with null value, need to do it later if have time.
## normalized data
## visualize data: country etc.
## use statistic model 
## check random forest
In [2]:
from google.cloud import storage
import pandas as pd
import numpy as np 
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as stats
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.preprocessing import StandardScaler
In [3]:
lifeExpec = pd.read_csv('gs://life2/LifeExpectancyData.csv', sep=",")
In [4]:
lifeExpec.head()
Out[4]:
Country Year Status Life expectancy Adult Mortality infant deaths Alcohol percentage expenditure Hepatitis B Measles ... Polio Total expenditure Diphtheria HIV/AIDS GDP Population thinness 1-19 years thinness 5-9 years Income composition of resources Schooling
0 Afghanistan 2015 Developing 65.0 263.0 62 0.01 71.279624 65.0 1154 ... 6.0 8.16 65.0 0.1 574.184114 33736494.0 17.2 17.3 0.479 10.1
1 Afghanistan 2014 Developing 59.9 271.0 64 0.01 73.523582 62.0 492 ... 58.0 8.18 62.0 0.1 583.656193 32758020.0 17.5 17.5 0.476 10.0
2 Afghanistan 2013 Developing 59.9 268.0 66 0.01 73.219243 64.0 430 ... 62.0 8.13 64.0 0.1 587.565090 31731688.0 17.7 17.7 0.470 9.9
3 Afghanistan 2012 Developing 59.5 272.0 69 0.01 78.184215 67.0 2787 ... 67.0 8.52 67.0 0.1 576.190126 30696958.0 17.9 18.0 0.463 9.8
4 Afghanistan 2011 Developing 59.2 275.0 71 0.01 7.097109 68.0 3013 ... 68.0 7.87 68.0 0.1 528.736648 29708599.0 18.2 18.2 0.454 9.5

5 rows × 22 columns

In [5]:
lifeExpec.shape
Out[5]:
(2938, 22)
In [6]:
lifeExpec.rename(columns={" BMI ":"BMI","Life expectancy ":"Life_Expectancy","Adult Mortality":"Adult_Mortality",
                   "infant deaths":"Infant_Deaths","percentage expenditure":"Percentage_Exp","Hepatitis B":"HepatitisB",
                  "Measles ":"Measles"," BMI ":"BMI","under-five deaths ":"Under_Five_Deaths","Diphtheria ":"Diphtheria",
                  " HIV/AIDS":"HIV/AIDS"," thinness  1-19 years":"thinness_10to19_years"," thinness 5-9 years":"thinness_5to9_years","Income composition of resources":"Income_Comp_Of_Resources",
                   "Total expenditure":"Tot_Exp"},inplace=True)
In [7]:
lifeExpec.columns
Out[7]:
Index(['Country', 'Year', 'Status', 'Life_Expectancy', 'Adult_Mortality',
       'Infant_Deaths', 'Alcohol', 'Percentage_Exp', 'HepatitisB', 'Measles',
       'BMI', 'Under_Five_Deaths', 'Polio', 'Tot_Exp', 'Diphtheria',
       'HIV/AIDS', 'GDP', 'Population', 'thinness_10to19_years',
       'thinness_5to9_years', 'Income_Comp_Of_Resources', 'Schooling'],
      dtype='object')
In [8]:
lifeExpec.isnull().sum()
Out[8]:
Country                       0
Year                          0
Status                        0
Life_Expectancy              10
Adult_Mortality              10
Infant_Deaths                 0
Alcohol                     194
Percentage_Exp                0
HepatitisB                  553
Measles                       0
BMI                          34
Under_Five_Deaths             0
Polio                        19
Tot_Exp                     226
Diphtheria                   19
HIV/AIDS                      0
GDP                          96
Population                    6
thinness_10to19_years        34
thinness_5to9_years          34
Income_Comp_Of_Resources    167
Schooling                   163
dtype: int64
In [9]:
country_list = lifeExpec.Country.unique()
fill_list = ['Life_Expectancy','Adult_Mortality','Alcohol','HepatitisB','BMI','Polio','Tot_Exp','Diphtheria','GDP','Population','thinness_10to19_years','thinness_5to9_years','Income_Comp_Of_Resources','Schooling']
In [10]:
for country in country_list:
    lifeExpec.loc[lifeExpec['Country'] == country,fill_list] = lifeExpec.loc[lifeExpec['Country'] == country,fill_list].interpolate()
In [11]:
# Drop remaining null values after interpolation.
lifeExpec.dropna(inplace=True)
lifeExpec.shape
Out[11]:
(2410, 22)
In [12]:
countryNames = lifeExpec.Country.unique()
countryNames.size
Out[12]:
161
In [13]:
life = lifeExpec.drop('Year', axis = 1)
In [14]:
decribe = life.describe()
In [15]:
decribe.round(2)
Out[15]:
Life_Expectancy Adult_Mortality Infant_Deaths Alcohol Percentage_Exp HepatitisB Measles BMI Under_Five_Deaths Polio Tot_Exp Diphtheria HIV/AIDS GDP Population thinness_10to19_years thinness_5to9_years Income_Comp_Of_Resources Schooling
count 2410.00 2410.00 2410.00 2410.00 2410.00 2410.00 2410.00 2410.00 2410.00 2410.00 2410.00 2410.00 2410.00 2410.00 2.410000e+03 2410.00 2410.00 2410.00 2410.00
mean 68.91 165.54 31.46 4.34 673.95 75.98 2341.07 37.73 43.58 82.09 5.77 82.15 1.72 10202.08 3.636359e+07 5.06 5.11 0.62 11.85
std 9.20 123.93 126.64 3.95 1694.80 28.44 11050.63 19.85 172.43 23.56 2.31 23.59 4.71 15385.91 1.411312e+08 4.57 4.66 0.21 3.16
min 36.30 1.00 0.00 0.01 0.00 2.00 0.00 1.00 0.00 3.00 0.37 2.00 0.10 194.87 8.113100e+04 0.10 0.10 0.00 0.00
25% 63.30 76.00 0.00 0.70 17.02 66.00 0.00 18.70 0.00 77.00 4.24 78.00 0.10 1282.14 2.059850e+06 1.70 1.70 0.49 10.00
50% 71.80 146.00 3.00 3.46 91.49 89.00 15.00 42.90 4.00 93.00 5.63 92.00 0.10 3890.07 7.489080e+06 3.50 3.40 0.67 12.20
75% 75.00 225.75 21.00 7.22 472.92 96.00 356.25 55.60 26.00 97.00 7.20 97.00 0.80 10479.23 2.317568e+07 7.40 7.40 0.76 13.90
max 89.00 723.00 1800.00 17.87 18961.35 99.00 212183.00 77.10 2500.00 99.00 14.39 99.00 43.50 111968.35 1.364270e+09 27.70 28.60 0.94 20.70
In [16]:
round(lifeExpec[['Status','Life_Expectancy']].groupby(['Status']).mean(),2)
Out[16]:
Life_Expectancy
Status
Developed 78.87
Developing 67.33
In [17]:
## life Expectancy vs status 
l =(round(lifeExpec.groupby('Status')['Life_Expectancy'].mean(), 2).to_numpy())
plt.figure(figsize=(6,6))
plt.bar(lifeExpec.groupby('Status')['Status'].count().index,lifeExpec.groupby('Status')['Life_Expectancy'].mean())
plt.xlabel("Status",fontsize=12,fontweight='bold')
plt.ylabel("Avg Life Expectancy",fontsize=12,fontweight='bold')
plt.ylim(0, 85)
plt.title("Life Expectancy vs Status", fontsize=15,fontweight='bold')
for i, v in enumerate(l):
    plt.text(i +0.05,v +2,str(v), color='blue', fontweight='bold' )
plt.show()
In [18]:
stats.ttest_ind(lifeExpec.loc[lifeExpec['Status']=='Developed','Life_Expectancy'],lifeExpec.loc[lifeExpec['Status']=='Developing','Life_Expectancy'])
Out[18]:
Ttest_indResult(statistic=23.449888363481808, pvalue=1.0626781954055896e-109)
In [19]:
## check outliers using boxplot
col_dict = {'Life_Expectancy':1,'Adult_Mortality':2,'Infant_Deaths':3,'Alcohol':4,'Percentage_Exp':5,'HepatitisB':6,'Measles':7,'BMI':8,'Under_Five_Deaths':9,'Polio':10,'Tot_Exp':11,'Diphtheria':12,'HIV/AIDS':13,'GDP':14,'Population':15,'thinness_10to19_years':16,
            'thinness_5to9_years':17,'Income_Comp_Of_Resources':18,'Schooling':19}
plt.figure(figsize=(20,30))

for variable,i in col_dict.items():
                     plt.subplot(5,4,i)
                     plt.boxplot(lifeExpec[variable],whis=1.5)
                     plt.title(variable)

plt.show()

The boxplot results shows there are outliers exist in each variables. Next, we will take a look at each indivual variable, to see if we need to normalize the variables.

In [20]:
## check life Expecancy
lifeExpec[lifeExpec["Life_Expectancy"] < 45]
Out[20]:
Country Year Status Life_Expectancy Adult_Mortality Infant_Deaths Alcohol Percentage_Exp HepatitisB Measles ... Polio Tot_Exp Diphtheria HIV/AIDS GDP Population thinness_10to19_years thinness_5to9_years Income_Comp_Of_Resources Schooling
1127 Haiti 2010 Developing 36.3 682.0 23 5.76 36.292918 68.0 0 ... 66.0 8.90 66.0 1.9 665.627419 9999617.0 4.0 4.0 0.470 8.6
1484 Lesotho 2005 Developing 44.5 675.0 5 2.67 57.903698 87.0 0 ... 88.0 6.30 89.0 34.8 946.045953 1949543.0 9.3 9.2 0.437 10.7
1485 Lesotho 2004 Developing 44.8 666.0 5 1.80 67.913618 6.0 31 ... 89.0 6.96 9.0 34.6 909.874430 1933728.0 9.7 9.7 0.439 10.7
1582 Malawi 2003 Developing 44.6 613.0 43 1.08 4.375316 84.0 167 ... 85.0 6.35 84.0 24.2 372.531249 12336687.0 7.6 7.5 0.362 10.3
1583 Malawi 2002 Developing 44.0 67.0 46 1.10 3.885395 64.0 92 ... 79.0 4.82 64.0 24.7 361.043546 12013711.0 7.7 7.6 0.388 10.4
1584 Malawi 2001 Developing 43.5 599.0 48 1.15 12.797606 64.0 150 ... 86.0 5.70 9.0 25.1 363.755174 11695863.0 7.9 7.7 0.387 10.1
1585 Malawi 2000 Developing 43.1 588.0 51 1.18 13.762702 64.0 304 ... 73.0 6.70 75.0 25.5 392.524585 11376172.0 8.0 7.9 0.391 10.7
2306 Sierra Leone 2006 Developing 44.3 464.0 30 3.80 38.000758 63.0 33 ... 65.0 1.68 64.0 2.2 357.219530 5848692.0 9.1 9.1 0.348 8.0
2307 Sierra Leone 2005 Developing 43.3 48.0 30 3.83 42.088929 63.0 29 ... 67.0 12.25 65.0 2.2 353.889419 5658379.0 9.3 9.3 0.341 7.8
2308 Sierra Leone 2004 Developing 42.3 496.0 30 3.99 38.524548 63.0 7 ... 69.0 11.66 65.0 2.1 351.822124 5439695.0 9.5 9.5 0.332 7.6
2309 Sierra Leone 2003 Developing 41.5 57.0 30 4.07 38.614732 63.0 586 ... 66.0 11.69 73.0 1.9 344.826417 5199549.0 9.7 9.8 0.322 7.4
2311 Sierra Leone 2001 Developing 41.0 519.0 30 4.21 33.346915 63.0 649 ... 38.0 11.83 38.0 1.5 272.991178 4739147.0 1.1 1.2 0.302 7.0
2312 Sierra Leone 2000 Developing 39.0 533.0 29 3.97 20.395683 63.0 3575 ... 46.0 13.63 44.0 1.2 302.264266 4564297.0 1.3 1.4 0.292 6.7
2920 Zambia 2001 Developing 44.6 611.0 43 2.61 46.830275 82.0 16997 ... 86.0 6.56 85.0 18.6 973.363972 10824125.0 7.4 7.4 0.424 9.8
2921 Zambia 2000 Developing 43.8 614.0 44 2.62 45.616880 82.0 30930 ... 85.0 7.16 85.0 18.7 948.736227 10531221.0 7.5 7.5 0.418 9.6
2932 Zimbabwe 2005 Developing 44.6 717.0 28 4.14 8.717409 65.0 420 ... 69.0 6.44 68.0 30.3 971.266723 12940032.0 9.0 9.0 0.406 9.3
2933 Zimbabwe 2004 Developing 44.3 723.0 27 4.36 0.000000 68.0 31 ... 67.0 7.13 65.0 33.6 1034.962988 12777511.0 9.4 9.4 0.407 9.2
2934 Zimbabwe 2003 Developing 44.5 715.0 26 4.06 0.000000 7.0 998 ... 7.0 6.52 68.0 36.7 1102.230755 12633897.0 9.8 9.9 0.418 9.5
2935 Zimbabwe 2002 Developing 44.8 73.0 25 4.43 0.000000 73.0 304 ... 73.0 6.53 71.0 39.8 1331.013034 12500525.0 1.2 1.3 0.427 10.0

19 rows × 22 columns

In [21]:
lifeExpec[lifeExpec["Country"] == 'Sierra Leone']['Life_Expectancy']
Out[21]:
2298    48.1
2299    54.0
2300    49.7
2301    48.9
2302    48.1
2303    47.1
2304    46.2
2305    45.3
2306    44.3
2307    43.3
2308    42.3
2309    41.5
2310    48.0
2311    41.0
2312    39.0
Name: Life_Expectancy, dtype: float64
In [22]:
lifeExpec[lifeExpec["Country"] == 'Malawi']['Life_Expectancy']
Out[22]:
1571    57.6
1572    56.7
1573    55.3
1574    54.1
1575    52.9
1576    51.5
1577    50.0
1578    48.5
1579    47.1
1580    46.0
1581    45.1
1582    44.6
1583    44.0
1584    43.5
1585    43.1
Name: Life_Expectancy, dtype: float64
In [23]:
lifeExpec[lifeExpec["Country"] == 'Lesotho']['Life_Expectancy']
Out[23]:
1475    52.1
1476    52.1
1477    52.2
1478    52.3
1479    51.1
1480    49.4
1481    47.8
1482    46.2
1483    45.3
1484    44.5
1485    44.8
1486    45.5
1487    46.4
1488    47.8
1489    49.3
Name: Life_Expectancy, dtype: float64
In [24]:
lifeExpec[lifeExpec["Country"] == 'Zambia']['Life_Expectancy']
Out[24]:
2907    61.1
2908    63.0
2909    59.2
2910    58.2
2911    58.0
2912    57.4
2913    55.7
2914    52.6
2915    58.0
2916    49.3
2917    47.9
2918    46.4
2919    45.5
2920    44.6
2921    43.8
Name: Life_Expectancy, dtype: float64
In [25]:
lifeExpec[lifeExpec["Country"] == 'Zimbabwe']['Life_Expectancy']
Out[25]:
2923    59.2
2924    58.0
2925    56.6
2926    54.9
2927    52.4
2928    50.0
2929    48.2
2930    46.6
2931    45.4
2932    44.6
2933    44.3
2934    44.5
2935    44.8
2936    45.3
2937    46.0
Name: Life_Expectancy, dtype: float64

Based on the analysis above, the life Expectancy low is mainly because these data are from developing country. Here we will keep the data.

In [26]:
## check  Adult_Mortality
lifeExpec[lifeExpec["Adult_Mortality"] > 500]
Out[26]:
Country Year Status Life_Expectancy Adult_Mortality Infant_Deaths Alcohol Percentage_Exp HepatitisB Measles ... Polio Tot_Exp Diphtheria HIV/AIDS GDP Population thinness_10to19_years thinness_5to9_years Income_Comp_Of_Resources Schooling
346 Botswana 2005 Developing 51.7 566.0 2 6.37 629.842564 92.0 5 ... 96.0 5.62 96.0 20.6 5686.780532 1855852.0 1.0 9.9 0.593 11.9
347 Botswana 2004 Developing 48.1 652.0 2 4.90 469.582390 91.0 1 ... 96.0 5.56 96.0 28.4 5542.305427 1829330.0 1.5 1.4 0.580 11.8
348 Botswana 2003 Developing 46.4 693.0 2 5.51 299.367125 9.0 59 ... 96.0 4.65 96.0 31.9 5493.144116 1804339.0 1.9 1.8 0.567 11.8
349 Botswana 2002 Developing 46.0 699.0 2 6.41 6.330007 88.0 7 ... 97.0 6.47 97.0 34.6 5341.920093 1779953.0 11.4 11.3 0.558 11.9
350 Botswana 2001 Developing 46.7 679.0 2 5.48 306.952735 87.0 1 ... 97.0 5.73 97.0 37.2 5126.354186 1754935.0 11.8 11.8 0.560 11.8
351 Botswana 2000 Developing 47.8 647.0 2 5.37 250.891648 86.0 2672 ... 97.0 4.64 97.0 38.8 5211.073705 1728340.0 12.3 12.2 0.559 11.7
522 Central African Republic 2005 Developing 45.9 511.0 17 1.50 40.922080 42.0 471 ... 47.0 4.29 54.0 11.2 417.095000 4127910.0 9.7 9.7 0.319 5.9
523 Central African Republic 2004 Developing 45.7 512.0 17 1.50 41.501117 42.0 1233 ... 45.0 4.10 51.0 12.0 421.535738 4055036.0 9.9 9.9 0.315 5.7
865 Eritrea 2000 Developing 45.3 593.0 7 0.83 0.735940 86.0 789 ... 82.0 4.43 81.0 1.9 845.979095 3392801.0 1.2 1.1 0.000 3.9
1127 Haiti 2010 Developing 36.3 682.0 23 5.76 36.292918 68.0 0 ... 66.0 8.90 66.0 1.9 665.627419 9999617.0 4.0 4.0 0.470 8.6
1475 Lesotho 2014 Developing 52.1 522.0 4 0.01 162.127812 93.0 0 ... 9.0 1.62 93.0 9.4 1389.875532 2145785.0 5.8 5.6 0.491 10.8
1476 Lesotho 2013 Developing 52.1 518.0 4 0.01 153.344315 93.0 516 ... 9.0 11.70 93.0 9.6 1361.732905 2117361.0 6.1 5.9 0.484 11.1
1477 Lesotho 2012 Developing 52.2 513.0 4 0.01 168.134899 95.0 179 ... 93.0 11.14 95.0 9.0 1341.947180 2089928.0 6.4 6.2 0.479 11.0
1479 Lesotho 2010 Developing 51.1 527.0 4 2.71 154.870600 93.0 2488 ... 92.0 1.87 93.0 13.4 1199.951766 2040551.0 7.2 7.0 0.464 10.9
1480 Lesotho 2009 Developing 49.4 566.0 4 2.75 104.314473 91.0 0 ... 89.0 9.80 91.0 18.2 1130.547707 2019209.0 7.6 7.4 0.453 10.8
1481 Lesotho 2008 Developing 47.8 592.0 5 2.75 91.854328 88.0 0 ... 86.0 8.85 88.0 27.3 1108.377775 1999930.0 8.0 7.8 0.447 10.7
1482 Lesotho 2007 Developing 46.2 633.0 4 2.69 9.184327 9.0 2 ... 87.0 8.47 88.0 30.0 1038.502990 1982287.0 8.4 8.3 0.440 10.6
1483 Lesotho 2006 Developing 45.3 654.0 5 2.61 71.155776 91.0 1 ... 88.0 7.12 89.0 34.1 989.124461 1965662.0 8.8 8.7 0.437 10.7
1484 Lesotho 2005 Developing 44.5 675.0 5 2.67 57.903698 87.0 0 ... 88.0 6.30 89.0 34.8 946.045953 1949543.0 9.3 9.2 0.437 10.7
1485 Lesotho 2004 Developing 44.8 666.0 5 1.80 67.913618 6.0 31 ... 89.0 6.96 9.0 34.6 909.874430 1933728.0 9.7 9.7 0.439 10.7
1486 Lesotho 2003 Developing 45.5 648.0 5 1.99 5.300902 17.0 1 ... 9.0 7.13 9.0 33.8 889.231755 1918097.0 1.2 1.1 0.440 10.5
1487 Lesotho 2002 Developing 46.4 622.0 5 2.95 3.534574 17.0 0 ... 84.0 6.91 84.0 32.5 845.642715 1902312.0 1.6 1.6 0.446 10.4
1488 Lesotho 2001 Developing 47.8 586.0 5 2.86 38.571870 17.0 217 ... 78.0 7.53 78.0 31.2 837.127863 1885955.0 11.1 11.1 0.443 10.3
1489 Lesotho 2000 Developing 49.3 543.0 5 3.10 29.866165 17.0 660 ... 82.0 6.92 83.0 29.8 809.505724 1868699.0 11.5 11.6 0.445 9.6
1577 Malawi 2008 Developing 50.0 525.0 36 1.27 74.344830 91.0 20 ... 92.0 1.70 91.0 16.9 437.895476 14271234.0 7.0 6.9 0.400 9.6
1578 Malawi 2007 Developing 48.5 559.0 37 1.18 4.269511 87.0 143 ... 88.0 9.31 87.0 19.3 418.588219 13840969.0 7.1 7.0 0.387 9.7
1579 Malawi 2006 Developing 47.1 587.0 38 1.18 6.847034 99.0 1 ... 99.0 8.99 99.0 21.1 392.759999 13429262.0 7.3 7.1 0.377 9.6
1581 Malawi 2004 Developing 45.1 615.0 40 1.11 58.135833 89.0 1116 ... 94.0 7.82 89.0 23.4 383.094189 12676038.0 7.5 7.4 0.366 10.0
1582 Malawi 2003 Developing 44.6 613.0 43 1.08 4.375316 84.0 167 ... 85.0 6.35 84.0 24.2 372.531249 12336687.0 7.6 7.5 0.362 10.3
1584 Malawi 2001 Developing 43.5 599.0 48 1.15 12.797606 64.0 150 ... 86.0 5.70 9.0 25.1 363.755174 11695863.0 7.9 7.7 0.387 10.1
1585 Malawi 2000 Developing 43.1 588.0 51 1.18 13.762702 64.0 304 ... 73.0 6.70 75.0 25.5 392.524585 11376172.0 8.0 7.9 0.391 10.7
2310 Sierra Leone 2002 Developing 48.0 513.0 30 4.06 36.591149 63.0 568 ... 54.0 11.96 53.0 1.7 330.395927 4957216.0 9.9 1.0 0.306 7.2
2311 Sierra Leone 2001 Developing 41.0 519.0 30 4.21 33.346915 63.0 649 ... 38.0 11.83 38.0 1.5 272.991178 4739147.0 1.1 1.2 0.302 7.0
2312 Sierra Leone 2000 Developing 39.0 533.0 29 3.97 20.395683 63.0 3575 ... 46.0 13.63 44.0 1.2 302.264266 4564297.0 1.3 1.4 0.292 6.7
2727 Uganda 2002 Developing 48.8 523.0 112 10.42 2.690898 29.0 49871 ... 57.0 7.78 57.0 10.0 450.295996 25718048.0 6.8 6.8 0.404 11.0
2728 Uganda 2001 Developing 47.7 539.0 115 10.57 26.976252 29.0 48543 ... 56.0 7.26 55.0 10.8 427.346774 24854892.0 6.9 6.9 0.396 10.8
2729 Uganda 2000 Developing 46.6 554.0 116 10.47 22.594475 29.0 42554 ... 55.0 6.77 52.0 11.6 418.978046 24039274.0 7.0 7.0 0.382 9.8
2915 Zambia 2006 Developing 58.0 526.0 33 2.25 1.860004 81.0 459 ... 83.0 6.11 81.0 15.9 1183.363859 12383446.0 7.0 6.9 0.479 10.9
2916 Zambia 2005 Developing 49.3 554.0 34 2.33 121.879331 82.0 45 ... 84.0 7.56 82.0 17.0 1126.031936 12052156.0 7.1 7.0 0.467 10.7
2917 Zambia 2004 Developing 47.9 578.0 36 2.46 8.369852 82.0 35 ... 84.0 7.33 83.0 17.6 1077.836387 11731746.0 7.2 7.1 0.456 10.5
2920 Zambia 2001 Developing 44.6 611.0 43 2.61 46.830275 82.0 16997 ... 86.0 6.56 85.0 18.6 973.363972 10824125.0 7.4 7.4 0.424 9.8
2921 Zambia 2000 Developing 43.8 614.0 44 2.62 45.616880 82.0 30930 ... 85.0 7.16 85.0 18.7 948.736227 10531221.0 7.5 7.5 0.418 9.6
2927 Zimbabwe 2010 Developing 52.4 527.0 29 5.21 53.308581 9.0 9696 ... 89.0 5.37 89.0 15.7 948.331854 14086317.0 7.1 7.0 0.436 10.0
2928 Zimbabwe 2009 Developing 50.0 587.0 30 4.64 1.040021 73.0 853 ... 69.0 6.26 73.0 18.1 803.222029 13810599.0 7.5 7.4 0.419 9.9
2929 Zimbabwe 2008 Developing 48.2 632.0 30 3.56 20.843429 75.0 0 ... 75.0 4.96 75.0 20.5 725.575974 13558469.0 7.8 7.8 0.421 9.7
2932 Zimbabwe 2005 Developing 44.6 717.0 28 4.14 8.717409 65.0 420 ... 69.0 6.44 68.0 30.3 971.266723 12940032.0 9.0 9.0 0.406 9.3
2933 Zimbabwe 2004 Developing 44.3 723.0 27 4.36 0.000000 68.0 31 ... 67.0 7.13 65.0 33.6 1034.962988 12777511.0 9.4 9.4 0.407 9.2
2934 Zimbabwe 2003 Developing 44.5 715.0 26 4.06 0.000000 7.0 998 ... 7.0 6.52 68.0 36.7 1102.230755 12633897.0 9.8 9.9 0.418 9.5
2936 Zimbabwe 2001 Developing 45.3 686.0 25 1.72 0.000000 76.0 529 ... 76.0 6.16 75.0 42.1 1464.672049 12366165.0 1.6 1.7 0.427 9.8
2937 Zimbabwe 2000 Developing 46.0 665.0 24 1.68 0.000000 79.0 1483 ... 78.0 7.10 78.0 43.5 1449.042767 12222251.0 11.0 11.2 0.434 9.8

50 rows × 22 columns

The boxplot shows the data the mortality in some countries, such as Zimbabwe, Zambia, Lesotho, Botswana,Malawi are all high. That may be caused by some factors. We will do specific analysis on these countries to see if we can dig out some important information. Here, we mainly investigate Central African Republic, Eritrea,Haiti, Sierra Leone to see if the mortality is abnormal.

In [27]:
lifeExpec[lifeExpec["Country"] == 'Central African Republic']
Out[27]:
Country Year Status Life_Expectancy Adult_Mortality Infant_Deaths Alcohol Percentage_Exp HepatitisB Measles ... Polio Tot_Exp Diphtheria HIV/AIDS GDP Population thinness_10to19_years thinness_5to9_years Income_Comp_Of_Resources Schooling
513 Central African Republic 2014 Developing 58.0 437.0 15 0.01 53.439643 47.0 210 ... 47.0 4.20 47.0 4.5 334.114551 4515392.0 8.4 8.3 0.345 7.1
514 Central African Republic 2013 Developing 49.9 451.0 16 0.01 52.377666 23.0 596 ... 23.0 3.82 23.0 5.1 335.062283 4499653.0 8.5 8.5 0.370 7.1
515 Central African Republic 2012 Developing 53.0 439.0 16 0.01 7.344808 47.0 141 ... 47.0 3.62 47.0 5.1 528.129707 4490416.0 8.7 8.6 0.366 7.1
516 Central African Republic 2011 Developing 49.8 443.0 16 1.66 58.529475 47.0 679 ... 47.0 3.73 47.0 5.8 504.746050 4476153.0 8.8 8.8 0.361 6.8
517 Central African Republic 2010 Developing 49.2 446.0 17 1.67 43.483592 45.0 2 ... 46.0 3.90 45.0 6.6 487.945383 4448525.0 9.0 8.9 0.352 6.6
518 Central African Republic 2009 Developing 48.6 453.0 17 1.56 40.451569 42.0 11 ... 45.0 3.58 42.0 7.3 471.633075 4404230.0 9.1 9.1 0.345 6.4
519 Central African Republic 2008 Developing 47.6 477.0 17 1.52 67.341375 42.0 12 ... 46.0 4.30 45.0 8.3 440.866694 4345386.0 9.3 9.2 0.338 6.3
520 Central African Republic 2007 Developing 46.8 495.0 17 1.50 60.048848 42.0 49 ... 46.0 4.40 48.0 9.0 439.747444 4275800.0 9.4 9.4 0.330 6.2
521 Central African Republic 2006 Developing 46.3 56.0 17 1.54 46.901179 42.0 3 ... 47.0 3.99 51.0 10.0 428.538856 4201758.0 9.6 9.6 0.323 6.0
522 Central African Republic 2005 Developing 45.9 511.0 17 1.50 40.922080 42.0 471 ... 47.0 4.29 54.0 11.2 417.095000 4127910.0 9.7 9.7 0.319 5.9
523 Central African Republic 2004 Developing 45.7 512.0 17 1.50 41.501117 42.0 1233 ... 45.0 4.10 51.0 12.0 421.535738 4055036.0 9.9 9.9 0.315 5.7
524 Central African Republic 2003 Developing 45.7 51.0 17 1.49 46.116194 42.0 652 ... 44.0 4.31 47.0 12.8 405.757985 3981665.0 1.0 1.1 0.316 5.6
525 Central African Republic 2002 Developing 45.6 58.0 17 1.47 31.594159 42.0 938 ... 42.0 4.16 44.0 13.4 437.826009 3907612.0 1.2 1.2 0.315 5.4
526 Central African Republic 2001 Developing 45.6 54.0 17 1.52 33.653157 42.0 2837 ... 4.0 3.95 4.0 13.9 431.639048 3832203.0 1.4 1.4 0.314 5.3
527 Central African Republic 2000 Developing 46.0 49.0 16 1.51 30.783827 42.0 3207 ... 38.0 4.24 37.0 14.3 422.451781 3754986.0 1.5 1.5 0.312 5.2

15 rows × 22 columns

Adult Mortality in Central African Republic looks very abnormal, the Adult Mortality between 2000 and 2003 are very low, arount 50, however, after 2004, it increased dramatically. Based on data from world bank, https://data.worldbank.org/indicator/SP.DYN.AMRT.MA?locations=CF, the Adult Mortality is always above 390. So the Adult Mortality in 2000, 2001, 2002, 2003 and 2006 are not correct. Here, the Adult Mortality of those years will be modified based on world bank data.

In [28]:
## make a copy
lifeExpecCopy = lifeExpec
In [29]:
## 2000 
lifeExpec.loc[527,'Adult_Mortality'] = 540.27   
## 2001
lifeExpec.loc[526,'Adult_Mortality'] = 548.42
## 2002
lifeExpec.loc[525,'Adult_Mortality'] = 556.57
##2003
lifeExpec.loc[524,'Adult_Mortality'] = 548.02
## 2006
lifeExpec.loc[521,'Adult_Mortality'] = 522.39
In [30]:
lifeExpec[lifeExpec["Country"] == 'Eritrea']
Out[30]:
Country Year Status Life_Expectancy Adult_Mortality Infant_Deaths Alcohol Percentage_Exp HepatitisB Measles ... Polio Tot_Exp Diphtheria HIV/AIDS GDP Population thinness_10to19_years thinness_5to9_years Income_Comp_Of_Resources Schooling
854 Eritrea 2011 Developing 62.9 286.0 6 0.62 20.979919 96.0 48 ... 96.0 3.60 96.0 0.7 715.872542 4474690.0 8.8 8.7 0.405 5.0
855 Eritrea 2010 Developing 62.1 298.0 6 0.61 17.357398 9.0 51 ... 9.0 3.24 9.0 0.9 667.744178 4390840.0 8.9 8.8 0.404 5.1
856 Eritrea 2009 Developing 61.4 311.0 6 0.63 1.575160 92.0 82 ... 92.0 3.30 92.0 1.0 663.986573 4310334.0 9.0 8.9 0.402 5.2
857 Eritrea 2008 Developing 67.0 322.0 6 0.49 11.765723 94.0 0 ... 94.0 3.69 94.0 1.1 651.133111 4232636.0 9.1 9.1 0.406 5.2
858 Eritrea 2007 Developing 62.0 329.0 6 1.23 11.423860 91.0 55 ... 91.0 3.29 91.0 1.3 737.696469 4153332.0 9.2 9.2 0.405 5.3
859 Eritrea 2006 Developing 59.7 336.0 7 0.97 10.602698 94.0 128 ... 94.0 3.30 94.0 1.4 746.841751 4066648.0 9.3 9.3 0.405 5.3
860 Eritrea 2005 Developing 59.4 34.0 7 1.07 5.064689 96.0 19 ... 96.0 2.97 96.0 1.6 778.575536 3969007.0 9.4 9.5 0.000 5.4
861 Eritrea 2004 Developing 59.1 342.0 7 0.64 10.260973 84.0 24 ... 98.0 3.14 98.0 1.8 788.855630 3858623.0 9.6 9.6 0.000 5.0
862 Eritrea 2003 Developing 58.8 343.0 7 0.56 6.913998 91.0 376 ... 95.0 3.50 93.0 1.9 813.091934 3738265.0 9.7 9.7 0.000 4.7
863 Eritrea 2002 Developing 58.5 343.0 7 0.83 0.703132 86.0 460 ... 92.0 4.20 9.0 1.9 875.643306 3614639.0 9.9 9.9 0.000 4.4
864 Eritrea 2001 Developing 58.1 345.0 7 0.61 5.593620 86.0 204 ... 89.0 3.95 86.0 2.0 888.160096 3497124.0 1.0 1.0 0.000 4.3
865 Eritrea 2000 Developing 45.3 593.0 7 0.83 0.735940 86.0 789 ... 82.0 4.43 81.0 1.9 845.979095 3392801.0 1.2 1.1 0.000 3.9

12 rows × 22 columns

The Adult_Mortality of Eritrea in 2005 is abnormal. Based on data from world bank, https://data.worldbank.org/indicator/SP.DYN.AMRT.MA, the Adult Mortality is 202.96. So we change the value of the Adult_Mortality of Eritrea in 2005 into 202.96.

In [31]:
##2005 Eritrea
lifeExpec.loc[860,'Adult_Mortality'] = 202.96
In [32]:
lifeExpec[lifeExpec["Country"] == 'Haiti']
Out[32]:
Country Year Status Life_Expectancy Adult_Mortality Infant_Deaths Alcohol Percentage_Exp HepatitisB Measles ... Polio Tot_Exp Diphtheria HIV/AIDS GDP Population thinness_10to19_years thinness_5to9_years Income_Comp_Of_Resources Schooling
1123 Haiti 2014 Developing 63.1 245.0 14 0.01 5.103249 48.0 0 ... 55.0 7.56 48.0 0.5 730.306352 10572466.0 3.9 3.9 0.487 9.1
1124 Haiti 2013 Developing 62.7 253.0 14 5.68 4.989712 68.0 0 ... 67.0 8.10 68.0 0.5 720.712871 10431776.0 3.9 3.9 0.483 9.1
1125 Haiti 2012 Developing 62.3 259.0 15 5.68 26.379425 68.0 0 ... 67.0 9.88 67.0 0.8 701.445964 10289210.0 3.9 3.9 0.477 8.9
1126 Haiti 2011 Developing 62.3 259.0 15 5.68 4.106484 68.0 0 ... 67.0 1.41 68.0 1.5 691.894267 10145054.0 4.0 4.0 0.470 8.7
1127 Haiti 2010 Developing 36.3 682.0 23 5.76 36.292918 68.0 0 ... 66.0 8.90 66.0 1.9 665.627419 9999617.0 4.0 4.0 0.470 8.6
1128 Haiti 2009 Developing 62.5 251.0 16 5.85 41.300795 68.0 0 ... 65.0 6.68 65.0 2.0 697.688911 9852870.0 4.1 4.1 0.466 8.5
1129 Haiti 2008 Developing 62.1 259.0 16 5.95 63.831957 68.0 0 ... 64.0 5.92 63.0 2.4 687.447964 9705029.0 4.2 4.2 0.462 8.4
1130 Haiti 2007 Developing 61.8 266.0 17 6.08 56.778587 68.0 0 ... 62.0 5.56 63.0 2.7 692.553621 9556889.0 4.2 4.2 0.458 8.4
1131 Haiti 2006 Developing 61.1 28.0 17 6.18 6.995556 68.0 0 ... 61.0 5.70 6.0 3.3 680.944672 9409457.0 4.3 4.3 0.455 8.3
1132 Haiti 2005 Developing 65.0 29.0 17 5.57 38.109043 68.0 0 ... 6.0 4.41 6.0 3.9 676.792900 9263404.0 4.4 4.4 0.452 8.2
1133 Haiti 2004 Developing 58.7 32.0 18 6.10 64.398533 68.0 0 ... 58.0 5.61 55.0 4.3 675.684080 9119178.0 4.5 4.5 0.450 8.1
1134 Haiti 2003 Developing 59.7 3.0 18 6.64 44.256871 68.0 0 ... 56.0 5.32 53.0 4.6 711.922303 8976552.0 4.5 4.6 0.447 8.1
1135 Haiti 2002 Developing 59.3 33.0 19 6.10 50.285582 68.0 0 ... 54.0 5.47 48.0 4.8 721.168250 8834733.0 4.6 4.7 0.444 8.0
1136 Haiti 2001 Developing 58.9 35.0 19 6.22 60.778159 68.0 159 ... 52.0 5.63 45.0 5.0 735.191499 8692567.0 4.7 4.7 0.443 7.9
1137 Haiti 2000 Developing 58.6 35.0 20 4.79 74.460330 68.0 992 ... 5.0 6.60 41.0 5.1 755.679890 8549200.0 4.8 4.8 0.439 7.8

15 rows × 22 columns

From above chart, we could see, the Adult_Mortality in Haiti in 2010,and from 2000 to 2006 are abnormal. Based on data from world bank, https://data.worldbank.org/indicator/SP.DYN.AMRT.MA?locations=HT, the Adult Mortality is decreasing, but the the Adult_Mortality is still above 250. So we need to modify the data based on world bank.

In [33]:
## 2000 
lifeExpec.loc[1137,'Adult_Mortality'] = 337.64  
## 2001
lifeExpec.loc[1136,'Adult_Mortality'] = 336.45
## 2002
lifeExpec.loc[1135,'Adult_Mortality'] = 335.25
##2003
lifeExpec.loc[1134,'Adult_Mortality'] = 329.85
## 2004
lifeExpec.loc[1133,'Adult_Mortality'] = 324.45
##2005
lifeExpec.loc[1132,'Adult_Mortality'] = 319.05
## 2006
lifeExpec.loc[1131,'Adult_Mortality'] = 313.65
## 2010
lifeExpec.loc[1127,'Adult_Mortality'] = 293.38
In [34]:
lifeExpec[lifeExpec["Country"] == 'Sierra Leone']
Out[34]:
Country Year Status Life_Expectancy Adult_Mortality Infant_Deaths Alcohol Percentage_Exp HepatitisB Measles ... Polio Tot_Exp Diphtheria HIV/AIDS GDP Population thinness_10to19_years thinness_5to9_years Income_Comp_Of_Resources Schooling
2298 Sierra Leone 2014 Developing 48.1 463.0 23 0.01 1.443286 83.0 1006 ... 83.0 11.90 83.0 0.6 567.834267 7079162.0 7.5 7.4 0.426 9.5
2299 Sierra Leone 2013 Developing 54.0 47.0 23 0.01 1.321464 92.0 15 ... 92.0 11.59 92.0 0.8 555.205562 6922079.0 7.7 7.6 0.413 9.3
2300 Sierra Leone 2012 Developing 49.7 411.0 25 0.01 54.560337 91.0 678 ... 91.0 11.24 91.0 0.9 470.301405 6766103.0 7.9 7.8 0.401 9.1
2301 Sierra Leone 2011 Developing 48.9 418.0 26 3.78 54.665918 89.0 1865 ... 88.0 11.98 89.0 1.3 417.603168 6611692.0 8.1 8.0 0.392 8.9
2302 Sierra Leone 2010 Developing 48.1 424.0 27 3.84 5.347718 86.0 1089 ... 84.0 1.32 86.0 1.6 401.835001 6458720.0 8.3 8.2 0.384 8.7
2303 Sierra Leone 2009 Developing 47.1 433.0 28 3.97 49.837127 84.0 31 ... 81.0 13.13 84.0 1.7 390.131035 6310260.0 8.5 8.4 0.375 8.5
2304 Sierra Leone 2008 Developing 46.2 441.0 29 3.91 5.379606 77.0 44 ... 75.0 1.29 77.0 1.9 386.653814 6165372.0 8.7 8.7 0.367 8.3
2305 Sierra Leone 2007 Developing 45.3 45.0 29 3.86 45.571089 63.0 0 ... 63.0 1.12 64.0 2.2 375.668000 6015417.0 8.9 8.9 0.357 8.2
2306 Sierra Leone 2006 Developing 44.3 464.0 30 3.80 38.000758 63.0 33 ... 65.0 1.68 64.0 2.2 357.219530 5848692.0 9.1 9.1 0.348 8.0
2307 Sierra Leone 2005 Developing 43.3 48.0 30 3.83 42.088929 63.0 29 ... 67.0 12.25 65.0 2.2 353.889419 5658379.0 9.3 9.3 0.341 7.8
2308 Sierra Leone 2004 Developing 42.3 496.0 30 3.99 38.524548 63.0 7 ... 69.0 11.66 65.0 2.1 351.822124 5439695.0 9.5 9.5 0.332 7.6
2309 Sierra Leone 2003 Developing 41.5 57.0 30 4.07 38.614732 63.0 586 ... 66.0 11.69 73.0 1.9 344.826417 5199549.0 9.7 9.8 0.322 7.4
2310 Sierra Leone 2002 Developing 48.0 513.0 30 4.06 36.591149 63.0 568 ... 54.0 11.96 53.0 1.7 330.395927 4957216.0 9.9 1.0 0.306 7.2
2311 Sierra Leone 2001 Developing 41.0 519.0 30 4.21 33.346915 63.0 649 ... 38.0 11.83 38.0 1.5 272.991178 4739147.0 1.1 1.2 0.302 7.0
2312 Sierra Leone 2000 Developing 39.0 533.0 29 3.97 20.395683 63.0 3575 ... 46.0 13.63 44.0 1.2 302.264266 4564297.0 1.3 1.4 0.292 6.7

15 rows × 22 columns

From above chart, we could see, the Adult_Mortality in Sierra Leone in 2003, 2005, 2007, 2013 are abnormal. Based on data from world bank, https://data.worldbank.org/indicator/SP.DYN.AMRT.MA, the Adult_Mortality is still above 400. So we need to modify the data based on world bank.

In [35]:
## modify the Adult_Mortality in Sierra Leone
## 2003
lifeExpec.loc[2309,'Adult_Mortality'] = 511.26
## 2005
lifeExpec.loc[2307,'Adult_Mortality'] = 483.17
## 2007
lifeExpec.loc[2305,'Adult_Mortality'] = 455.08
##2013
lifeExpec.loc[2299,'Adult_Mortality'] = 409.65
In [36]:
lifeExpec[lifeExpec["Country"] == 'Sierra Leone']
Out[36]:
Country Year Status Life_Expectancy Adult_Mortality Infant_Deaths Alcohol Percentage_Exp HepatitisB Measles ... Polio Tot_Exp Diphtheria HIV/AIDS GDP Population thinness_10to19_years thinness_5to9_years Income_Comp_Of_Resources Schooling
2298 Sierra Leone 2014 Developing 48.1 463.00 23 0.01 1.443286 83.0 1006 ... 83.0 11.90 83.0 0.6 567.834267 7079162.0 7.5 7.4 0.426 9.5
2299 Sierra Leone 2013 Developing 54.0 409.65 23 0.01 1.321464 92.0 15 ... 92.0 11.59 92.0 0.8 555.205562 6922079.0 7.7 7.6 0.413 9.3
2300 Sierra Leone 2012 Developing 49.7 411.00 25 0.01 54.560337 91.0 678 ... 91.0 11.24 91.0 0.9 470.301405 6766103.0 7.9 7.8 0.401 9.1
2301 Sierra Leone 2011 Developing 48.9 418.00 26 3.78 54.665918 89.0 1865 ... 88.0 11.98 89.0 1.3 417.603168 6611692.0 8.1 8.0 0.392 8.9
2302 Sierra Leone 2010 Developing 48.1 424.00 27 3.84 5.347718 86.0 1089 ... 84.0 1.32 86.0 1.6 401.835001 6458720.0 8.3 8.2 0.384 8.7
2303 Sierra Leone 2009 Developing 47.1 433.00 28 3.97 49.837127 84.0 31 ... 81.0 13.13 84.0 1.7 390.131035 6310260.0 8.5 8.4 0.375 8.5
2304 Sierra Leone 2008 Developing 46.2 441.00 29 3.91 5.379606 77.0 44 ... 75.0 1.29 77.0 1.9 386.653814 6165372.0 8.7 8.7 0.367 8.3
2305 Sierra Leone 2007 Developing 45.3 455.08 29 3.86 45.571089 63.0 0 ... 63.0 1.12 64.0 2.2 375.668000 6015417.0 8.9 8.9 0.357 8.2
2306 Sierra Leone 2006 Developing 44.3 464.00 30 3.80 38.000758 63.0 33 ... 65.0 1.68 64.0 2.2 357.219530 5848692.0 9.1 9.1 0.348 8.0
2307 Sierra Leone 2005 Developing 43.3 483.17 30 3.83 42.088929 63.0 29 ... 67.0 12.25 65.0 2.2 353.889419 5658379.0 9.3 9.3 0.341 7.8
2308 Sierra Leone 2004 Developing 42.3 496.00 30 3.99 38.524548 63.0 7 ... 69.0 11.66 65.0 2.1 351.822124 5439695.0 9.5 9.5 0.332 7.6
2309 Sierra Leone 2003 Developing 41.5 511.26 30 4.07 38.614732 63.0 586 ... 66.0 11.69 73.0 1.9 344.826417 5199549.0 9.7 9.8 0.322 7.4
2310 Sierra Leone 2002 Developing 48.0 513.00 30 4.06 36.591149 63.0 568 ... 54.0 11.96 53.0 1.7 330.395927 4957216.0 9.9 1.0 0.306 7.2
2311 Sierra Leone 2001 Developing 41.0 519.00 30 4.21 33.346915 63.0 649 ... 38.0 11.83 38.0 1.5 272.991178 4739147.0 1.1 1.2 0.302 7.0
2312 Sierra Leone 2000 Developing 39.0 533.00 29 3.97 20.395683 63.0 3575 ... 46.0 13.63 44.0 1.2 302.264266 4564297.0 1.3 1.4 0.292 6.7

15 rows × 22 columns

Infant death are the number of Infant Deaths per 1000 population. Based on the boxplot, we can see there is some value greater than 1000. These values must be not correct. The details shows below.

In [37]:
lifeExpec[lifeExpec["Infant_Deaths"] > 750]
Out[37]:
Country Year Status Life_Expectancy Adult_Mortality Infant_Deaths Alcohol Percentage_Exp HepatitisB Measles ... Polio Tot_Exp Diphtheria HIV/AIDS GDP Population thinness_10to19_years thinness_5to9_years Income_Comp_Of_Resources Schooling
1187 India 2014 Developing 68.0 184.0 957 3.07 86.521539 79.0 79563 ... 84.0 4.69 85.0 0.2 1640.180700 1.293859e+09 26.8 27.4 0.607 11.6
1188 India 2013 Developing 67.6 187.0 1000 3.11 67.672304 7.0 13822 ... 82.0 4.53 83.0 0.2 1544.619247 1.278562e+09 26.8 27.5 0.599 11.5
1189 India 2012 Developing 67.3 19.0 1100 3.10 64.969645 73.0 18668 ... 79.0 4.39 82.0 0.2 1469.177610 1.263066e+09 26.9 27.6 0.590 11.3
1190 India 2011 Developing 66.8 193.0 1100 3.00 64.605901 44.0 33634 ... 79.0 4.33 82.0 0.2 1410.426305 1.247236e+09 26.9 27.7 0.580 10.8
1191 India 2010 Developing 66.4 196.0 1200 2.77 57.733599 38.0 31458 ... 76.0 4.28 79.0 0.2 1357.563719 1.230981e+09 27.0 27.8 0.569 10.4
1192 India 2009 Developing 66.0 2.0 1300 2.50 0.844186 37.0 56188 ... 73.0 4.38 74.0 0.2 1268.249210 1.214270e+09 27.0 27.8 0.563 10.5
1193 India 2008 Developing 65.5 23.0 1300 1.93 43.030433 29.0 44258 ... 69.0 4.34 7.0 0.3 1192.511732 1.197147e+09 27.0 27.9 0.556 10.2
1194 India 2007 Developing 65.2 26.0 1400 1.59 5.234770 6.0 41144 ... 67.0 4.23 64.0 0.3 1173.875310 1.179681e+09 27.1 28.0 0.546 9.9
1195 India 2006 Developing 64.8 28.0 1500 1.37 34.859427 6.0 64185 ... 66.0 4.25 65.0 0.3 1106.926470 1.161978e+09 27.1 28.0 0.536 9.7
1196 India 2005 Developing 64.4 211.0 1500 1.27 3.509637 8.0 36711 ... 65.0 4.28 65.0 0.3 1040.312313 1.144119e+09 27.2 28.1 0.526 9.4
1197 India 2004 Developing 64.0 214.0 1600 1.20 27.338009 6.0 55443 ... 58.0 4.22 63.0 0.3 979.283848 1.126136e+09 27.2 28.2 0.518 9.2
1198 India 2003 Developing 63.7 216.0 1700 1.19 19.480868 6.0 47147 ... 57.0 4.30 61.0 0.3 922.167960 1.108028e+09 27.3 28.3 0.505 8.6
1199 India 2002 Developing 63.3 219.0 1700 1.10 17.812056 6.0 40044 ... 58.0 4.40 59.0 0.3 869.201387 1.089807e+09 27.4 28.4 0.499 8.4
1200 India 2001 Developing 62.9 222.0 1800 1.00 19.003406 6.0 51780 ... 58.0 4.50 59.0 0.3 851.616569 1.071478e+09 27.5 28.5 0.494 8.3
1201 India 2000 Developing 62.5 224.0 1800 0.93 19.266157 6.0 38835 ... 57.0 4.26 58.0 0.3 826.592493 1.053051e+09 27.7 28.6 0.489 8.3

15 rows × 22 columns

The above shows that the infant death in India is greater than 1000. Based on data from world bank, https://data.worldbank.org/indicator/SP.DYN.IMRT.IN?locations=IN, the Adult_Mortality is below 400. So, we need to modify the infant death rate in India.

In [38]:
## 2000
lifeExpec.loc[1201,'Infant_Deaths'] = 66.7
## 2001
lifeExpec.loc[1200,'Infant_Deaths'] = 64.4
## 2002
lifeExpec.loc[1199,'Infant_Deaths'] = 62.2
##2003
lifeExpec.loc[1198,'Infant_Deaths'] = 60
## 2004
lifeExpec.loc[1197,'Infant_Deaths'] = 57.8
## 2005
lifeExpec.loc[1196,'Infant_Deaths'] = 55.7
## 2006
lifeExpec.loc[1195,'Infant_Deaths'] = 53.7
##2007
lifeExpec.loc[1194,'Infant_Deaths'] = 51.6
## 2008
lifeExpec.loc[1193,'Infant_Deaths'] = 49.4
## 2009
lifeExpec.loc[1192,'Infant_Deaths'] = 47.3
## 2010
lifeExpec.loc[1191,'Infant_Deaths'] = 45.1
##2011
lifeExpec.loc[1190,'Infant_Deaths'] = 43
##2012
lifeExpec.loc[1189,'Infant_Deaths'] = 40.9
## 2013
lifeExpec.loc[1188,'Infant_Deaths'] = 38.8
## 2014
lifeExpec.loc[1187,'Infant_Deaths'] = 36.9
In [39]:
lifeExpec[lifeExpec["Alcohol"] > 16]
Out[39]:
Country Year Status Life_Expectancy Adult_Mortality Infant_Deaths Alcohol Percentage_Exp HepatitisB Measles ... Polio Tot_Exp Diphtheria HIV/AIDS GDP Population thinness_10to19_years thinness_5to9_years Income_Comp_Of_Resources Schooling
227 Belarus 2012 Developing 71.9 194.0 0.0 16.35 91.709621 97.0 10 ... 98.0 5.10 98.0 0.1 6642.035132 9464495.0 2.0 2.1 0.793 15.6
228 Belarus 2011 Developing 72.0 232.0 0.0 17.31 846.911307 98.0 50 ... 98.0 4.92 98.0 0.1 6525.851369 9473172.0 2.0 2.1 0.787 15.5
873 Estonia 2008 Developing 74.2 167.0 0.0 16.99 225.072362 94.0 0 ... 95.0 6.60 95.0 0.1 16752.282740 1337090.0 2.0 2.1 0.835 16.1
874 Estonia 2007 Developing 73.0 189.0 0.0 17.87 1904.124690 95.0 1 ... 95.0 5.16 95.0 0.1 17603.243270 1340680.0 2.0 2.1 0.829 16.1
875 Estonia 2006 Developing 73.0 188.0 0.0 16.58 244.351080 95.0 27 ... 95.0 5.10 95.0 0.1 16289.429080 1346810.0 2.1 2.2 0.822 16.1

5 rows × 22 columns

It seems that there is only two countries whose alcohol values is higher,so a investigatation will be done below.

In [40]:
lifeExpec[lifeExpec["Country"] == 'Belarus']["Alcohol"]
Out[40]:
225    13.94
226    14.66
227    16.35
228    17.31
229    14.44
230    14.09
231    14.67
232    14.22
233    12.60
234    11.01
235    12.05
236    11.17
237    12.23
238    10.74
239    12.98
Name: Alcohol, dtype: float64
In [41]:
lifeExpec[lifeExpec["Country"] == 'Estonia']["Alcohol"]
Out[41]:
867     0.01
868     0.01
869     0.01
870     0.01
871    14.97
872    15.04
873    16.99
874    17.87
875    16.58
876    15.52
877    15.07
878    11.64
879    11.48
880     0.01
881     0.01
Name: Alcohol, dtype: float64

The data here looks a ittle bit sketptical, but without concrete reason, no further step is taken at this moment.

The lifeExpec data looks skeptical based on the boxplot and analysis below. As discribe on the website, Expenditure on health as a percentage of Gross Domestic Product per capita(%). For now, just keep the data here, further investigation will be done.

In [42]:
lifeExpec["Percentage_Exp"].max()
Out[42]:
18961.3486
In [43]:
lifeExpec["Percentage_Exp"].min()
Out[43]:
0.0
In [44]:
lifeExpec[lifeExpec["HepatitisB"] < 5].head()
Out[44]:
Country Year Status Life_Expectancy Adult_Mortality Infant_Deaths Alcohol Percentage_Exp HepatitisB Measles ... Polio Tot_Exp Diphtheria HIV/AIDS GDP Population thinness_10to19_years thinness_5to9_years Income_Comp_Of_Resources Schooling
461 Cabo Verde 2002 Developing 77.0 148.0 0.0 3.82 155.207267 4.0 0 ... 92.0 5.17 91.0 0.8 2303.971345 452106.0 9.2 9.1 0.569 11.3
462 Cabo Verde 2001 Developing 73.0 152.0 0.0 3.81 150.743486 4.0 0 ... 91.0 5.19 9.0 0.8 2225.412110 443716.0 9.4 9.3 0.562 11.0
463 Cabo Verde 2000 Developing 69.9 155.0 0.0 3.49 122.574470 4.0 2 ... 9.0 4.81 9.0 0.8 2215.068170 435079.0 9.6 9.5 0.000 11.3
531 Chad 2012 Developing 51.8 367.0 46.0 0.62 57.824271 4.0 120 ... 51.0 3.00 4.0 3.6 908.426122 12705135.0 9.0 8.9 0.381 7.3
835 Equatorial Guinea 2014 Developing 57.9 32.0 3.0 0.01 13.404774 2.0 13 ... 24.0 3.80 2.0 4.4 16130.336370 1129424.0 8.5 8.4 0.582 9.2

5 rows × 22 columns

HepatitisB: is Hepatitis B (HepB) immunization coverage among 1-year-olds, the HepatitisB rate lower than 5% are from developed country, so we will leave the data as original.

In [45]:
lifeExpec[lifeExpec["Measles"] > 30000]
Out[45]:
Country Year Status Life_Expectancy Adult_Mortality Infant_Deaths Alcohol Percentage_Exp HepatitisB Measles ... Polio Tot_Exp Diphtheria HIV/AIDS GDP Population thinness_10to19_years thinness_5to9_years Income_Comp_Of_Resources Schooling
406 Burkina Faso 2009 Developing 56.9 283.0 44.0 4.55 81.143047 92.0 54118 ... 91.0 7.41 92.0 1.1 562.841973 1.514110e+07 9.3 8.8 0.356 5.9
561 China 2014 Developing 75.8 86.0 171.0 5.78 109.874390 99.0 52628 ... 99.0 5.55 99.0 0.1 6096.487817 1.364270e+09 3.7 3.0 0.723 13.1
565 China 2010 Developing 75.0 92.0 231.0 5.75 5.660755 99.0 38159 ... 99.0 4.89 99.0 0.1 4550.453596 1.337705e+09 4.2 3.6 0.691 12.5
566 China 2009 Developing 74.9 93.0 248.0 4.88 50.283489 99.0 52461 ... 99.0 5.80 99.0 0.1 4132.902312 1.331260e+09 4.4 3.8 0.682 12.2
567 China 2008 Developing 74.5 97.0 266.0 4.27 39.225097 95.0 131441 ... 99.0 4.59 97.0 0.1 3796.633363 1.324655e+09 4.5 4.0 0.672 11.9
568 China 2007 Developing 74.4 96.0 285.0 3.88 312.662482 92.0 109023 ... 94.0 4.32 93.0 0.1 3480.152725 1.317885e+09 4.7 4.1 0.659 11.4
569 China 2006 Developing 74.2 98.0 307.0 3.28 29.743430 91.0 99602 ... 94.0 4.52 93.0 0.1 3062.534905 1.311020e+09 4.8 4.3 0.646 11.0
570 China 2005 Developing 73.9 99.0 332.0 2.92 171.659603 84.0 124219 ... 87.0 4.66 87.0 0.1 2732.165880 1.303720e+09 5.0 4.4 0.634 10.6
571 China 2004 Developing 73.5 11.0 360.0 3.04 1.586685 79.0 70549 ... 87.0 4.72 87.0 0.1 2467.132843 1.296075e+09 5.1 4.6 0.622 10.2
572 China 2003 Developing 73.1 13.0 391.0 2.96 122.936535 75.0 71879 ... 87.0 4.82 86.0 0.1 2253.929689 1.288400e+09 5.3 4.7 0.610 9.9
573 China 2002 Developing 72.7 16.0 422.0 2.91 106.359036 7.0 58341 ... 86.0 4.79 86.0 0.1 2061.162284 1.280400e+09 5.5 4.9 0.600 9.7
574 China 2001 Developing 72.2 11.0 457.0 2.84 14.230645 65.0 88962 ... 86.0 4.56 86.0 0.1 1901.407630 1.271850e+09 5.7 5.0 0.592 9.6
575 China 2000 Developing 71.7 115.0 490.0 3.06 17.460574 6.0 71093 ... 86.0 4.60 85.0 0.1 1767.833627 1.262645e+09 5.9 5.1 0.583 9.5
1187 India 2014 Developing 68.0 184.0 36.9 3.07 86.521539 79.0 79563 ... 84.0 4.69 85.0 0.2 1640.180700 1.293859e+09 26.8 27.4 0.607 11.6
1190 India 2011 Developing 66.8 193.0 43.0 3.00 64.605901 44.0 33634 ... 79.0 4.33 82.0 0.2 1410.426305 1.247236e+09 26.9 27.7 0.580 10.8
1191 India 2010 Developing 66.4 196.0 45.1 2.77 57.733599 38.0 31458 ... 76.0 4.28 79.0 0.2 1357.563719 1.230981e+09 27.0 27.8 0.569 10.4
1192 India 2009 Developing 66.0 2.0 47.3 2.50 0.844186 37.0 56188 ... 73.0 4.38 74.0 0.2 1268.249210 1.214270e+09 27.0 27.8 0.563 10.5
1193 India 2008 Developing 65.5 23.0 49.4 1.93 43.030433 29.0 44258 ... 69.0 4.34 7.0 0.3 1192.511732 1.197147e+09 27.0 27.9 0.556 10.2
1194 India 2007 Developing 65.2 26.0 51.6 1.59 5.234770 6.0 41144 ... 67.0 4.23 64.0 0.3 1173.875310 1.179681e+09 27.1 28.0 0.546 9.9
1195 India 2006 Developing 64.8 28.0 53.7 1.37 34.859427 6.0 64185 ... 66.0 4.25 65.0 0.3 1106.926470 1.161978e+09 27.1 28.0 0.536 9.7
1196 India 2005 Developing 64.4 211.0 55.7 1.27 3.509637 8.0 36711 ... 65.0 4.28 65.0 0.3 1040.312313 1.144119e+09 27.2 28.1 0.526 9.4
1197 India 2004 Developing 64.0 214.0 57.8 1.20 27.338009 6.0 55443 ... 58.0 4.22 63.0 0.3 979.283848 1.126136e+09 27.2 28.2 0.518 9.2
1198 India 2003 Developing 63.7 216.0 60.0 1.19 19.480868 6.0 47147 ... 57.0 4.30 61.0 0.3 922.167960 1.108028e+09 27.3 28.3 0.505 8.6
1199 India 2002 Developing 63.3 219.0 62.2 1.10 17.812056 6.0 40044 ... 58.0 4.40 59.0 0.3 869.201387 1.089807e+09 27.4 28.4 0.499 8.4
1200 India 2001 Developing 62.9 222.0 64.4 1.00 19.003406 6.0 51780 ... 58.0 4.50 59.0 0.3 851.616569 1.071478e+09 27.5 28.5 0.494 8.3
1201 India 2000 Developing 62.5 224.0 66.7 0.93 19.266157 6.0 38835 ... 57.0 4.26 58.0 0.3 826.592493 1.053051e+09 27.7 28.6 0.489 8.3
1240 Iraq 2009 Developing 74.0 148.0 32.0 0.20 185.636698 75.0 30328 ... 78.0 4.65 78.0 0.1 4493.184126 2.989465e+07 5.4 5.1 0.643 10.3
1565 Madagascar 2004 Developing 64.0 267.0 38.0 0.81 23.727963 71.0 35558 ... 74.0 4.89 78.0 0.6 465.987257 1.780300e+07 8.2 8.1 0.466 8.7
1566 Madagascar 2003 Developing 59.9 268.0 40.0 0.93 37.128948 61.0 62233 ... 65.0 4.81 66.0 0.7 456.135641 1.727914e+07 8.3 8.3 0.457 8.5
1569 Madagascar 2000 Developing 57.9 283.0 44.0 1.16 35.661251 51.0 35256 ... 58.0 5.80 57.0 0.6 491.820179 1.576681e+07 8.7 8.6 0.000 8.0
1575 Malawi 2010 Developing 52.9 462.0 35.0 1.08 9.728005 93.0 118712 ... 86.0 1.50 93.0 13.7 478.668590 1.516710e+07 6.8 6.7 0.430 10.2
1888 Niger 2004 Developing 52.9 279.0 56.0 0.11 20.861184 71.0 63057 ... 45.0 6.61 43.0 1.6 324.016216 1.312701e+07 12.1 12.0 0.270 3.1
1889 Niger 2003 Developing 52.1 28.0 56.0 0.10 20.268766 71.0 54190 ... 44.0 6.23 41.0 1.6 335.923960 1.265687e+07 12.3 12.2 0.266 3.0
1890 Niger 2002 Developing 51.4 282.0 57.0 0.10 17.587227 71.0 31584 ... 43.0 6.55 39.0 1.6 331.002152 1.220600e+07 12.5 12.5 0.261 2.9
1891 Niger 2001 Developing 56.0 283.0 57.0 0.11 1.817830 71.0 61208 ... 42.0 7.10 36.0 1.6 333.358919 1.177198e+07 12.7 12.7 0.255 2.9
1895 Nigeria 2013 Developing 53.2 367.0 498.0 8.30 194.203288 46.0 52852 ... 46.0 3.70 46.0 3.9 2476.863878 1.718293e+08 1.4 1.2 0.514 9.8
1903 Nigeria 2005 Developing 49.2 4.0 556.0 9.71 6.416253 18.0 110927 ... 45.0 4.11 36.0 5.4 1857.925564 1.389395e+08 12.9 12.9 0.463 8.9
1904 Nigeria 2004 Developing 48.5 47.0 563.0 9.76 57.225558 18.0 31521 ... 43.0 4.33 33.0 5.4 1791.261545 1.353936e+08 13.2 13.2 0.445 8.5
1905 Nigeria 2003 Developing 48.1 41.0 567.0 9.75 30.195508 18.0 141258 ... 42.0 4.50 29.0 5.4 1682.099989 1.319725e+08 13.5 13.6 0.000 8.1
1906 Nigeria 2002 Developing 47.7 49.0 571.0 9.61 17.137754 18.0 42007 ... 4.0 2.43 25.0 5.3 1607.238267 1.286667e+08 13.8 13.8 0.000 7.7
1907 Nigeria 2001 Developing 47.4 48.0 574.0 9.58 15.830985 18.0 168107 ... 36.0 3.25 27.0 5.1 1429.196527 1.254634e+08 14.1 14.1 0.000 8.0
1908 Nigeria 2000 Developing 47.1 45.0 576.0 9.23 22.481776 18.0 212183 ... 31.0 2.84 29.0 4.9 1383.666051 1.223520e+08 14.3 14.4 0.000 7.6
2024 Philippines 2014 Developing 68.4 214.0 54.0 4.52 31.272322 67.0 58848 ... 77.0 4.71 67.0 0.1 2495.575295 1.001022e+08 1.0 9.7 0.676 11.7
2695 Turkey 2001 Developing 78.0 14.0 41.0 1.49 256.434175 77.0 30509 ... 88.0 5.16 88.0 0.1 7631.558706 6.419147e+07 5.2 5.1 0.653 11.1
2727 Uganda 2002 Developing 48.8 523.0 112.0 10.42 2.690898 29.0 49871 ... 57.0 7.78 57.0 10.0 450.295996 2.571805e+07 6.8 6.8 0.404 11.0
2728 Uganda 2001 Developing 47.7 539.0 115.0 10.57 26.976252 29.0 48543 ... 56.0 7.26 55.0 10.8 427.346774 2.485489e+07 6.9 6.9 0.396 10.8
2729 Uganda 2000 Developing 46.6 554.0 116.0 10.47 22.594475 29.0 42554 ... 55.0 6.77 52.0 11.6 418.978046 2.403927e+07 7.0 7.0 0.382 9.8
2739 Ukraine 2006 Developing 67.7 267.0 5.0 7.99 29.381727 96.0 42724 ... 99.0 6.39 98.0 0.8 2983.857295 4.678775e+07 2.6 2.7 0.716 14.7
2921 Zambia 2000 Developing 43.8 614.0 44.0 2.62 45.616880 82.0 30930 ... 85.0 7.16 85.0 18.7 948.736227 1.053122e+07 7.5 7.5 0.418 9.6

49 rows × 22 columns

In [46]:
lifeExpec["Measles"].max()
Out[46]:
212183

As the more Measles cases are in developing country, without further evidence that Measles is not correct, we will keep the data as original.

In the beggining, we analyze whether developed country or developing country, it make different with the Life_Expectancy. Next we will analyze other factors that could affect the life expendency. Let's see the heat map.

In [47]:
corr = lifeExpec.corr()
ax = plt.axes()
sns.heatmap(corr, 
        xticklabels=corr.columns,
        yticklabels=corr.columns)
ax.set_title("Heat Map",fontsize = 20,fontweight='bold')
Out[47]:
Text(0.5, 1.0, 'Heat Map')

From the heat map above, we could see life Expectancy has negative correlation with Adult_Mortality, Infant_Deaths, Measles, HIV/AIDS, under-five-year death, thinness 1-19 years, thinness 5-9 years. Life Expectancy has positive correlation with alcohol, percentage_Exp, BMI,Diphtheria,GDP,income_comp_Of_Resource, and Schoolng. For poio, population, the correlation is so small, we could not say it's negative or positive. It worth to mention that life expectancy has high correlation with adult_mortality, BMI, Diphtheria, HIV/AIDS,thinness_1-19_years, thiness_5_to_9_years, income_of_resource and schooling. Further analysis is needed to gain more information.

Next we will investigate factors one by one, to see which factor will affect life expectancy and how it will affect. To dive more, the factors will be divided into immunization factors, mortality factors, economic factors, social factors and other health related factors.

Mortality Factor

In [48]:
## divide country into developed and developing counties
lifeExpecDevelping = lifeExpec[lifeExpec["Status"] == "Developing"]
lifeExpecDevelped = lifeExpec[lifeExpec["Status"] == "Developed"]
lifeExpecDevelping.shape
Out[48]:
(2080, 22)
In [49]:
lifeExpec['Life_Expectancy'].corr(lifeExpec['Adult_Mortality'])
Out[49]:
-0.7028445867028081
In [50]:
ax = plt.axes()
sns.scatterplot(data = lifeExpec, y = lifeExpec['Life_Expectancy'], x = lifeExpec['Adult_Mortality'],hue = "Status")
plt.ylabel('Life Expectancy', fontsize = 15,fontweight='bold')
plt.xlabel('Adult Mortality', fontsize = 15,fontweight='bold')
ax.set_title("Life Expectancy vs Adult Mortality",  fontsize = 20,fontweight='bold')
Out[50]:
Text(0.5, 1.0, 'Life Expectancy vs Adult Mortality')
In [51]:
lifeExpecDevelped['Life_Expectancy'].corr(lifeExpecDevelped['Adult_Mortality'])
Out[51]:
-0.4799820749990814
In [52]:
ax = plt.axes()
sns.scatterplot(data = lifeExpecDevelped, y = lifeExpecDevelped['Life_Expectancy'], x = lifeExpecDevelped['Adult_Mortality'],hue = "Country")
plt.ylabel('Life Expectancy', fontsize = 15,fontweight='bold')
plt.xlabel('Adult_Mortality', fontsize = 15,fontweight='bold')
ax.set_title("Life Expectancy vs Adult_Mortality in Developed Countries",  fontsize = 12,fontweight='bold')

plt.legend(loc='upper left', bbox_to_anchor=(-0.2, -0.06),fancybox=True, shadow=True, ncol=5)
Out[52]:
<matplotlib.legend.Legend at 0x7f44cb26fac0>
In [53]:
lifeExpecDevelping['Life_Expectancy'].corr(lifeExpecDevelping['Adult_Mortality'])
Out[53]:
-0.6813495919717263
In [54]:
ax = plt.axes()
sns.scatterplot(data = lifeExpecDevelping, y = lifeExpecDevelping['Life_Expectancy'], x = lifeExpecDevelping['Adult_Mortality'],hue = "Country")
plt.ylabel('Life Expectancy', fontsize = 15,fontweight='bold')
plt.xlabel('Adult_Mortality', fontsize = 15,fontweight='bold')
ax.set_title("Life Expectancy vs Adult_Mortality",  fontsize = 15,fontweight='bold')

plt.legend(loc='upper left', bbox_to_anchor=(-0.2, -0.06),fancybox=True, shadow=True, ncol=5)
Out[54]:
<matplotlib.legend.Legend at 0x7f44cb16f130>

Above, the pearson correlation is used to test the correlation between Life Expectancy and Adult Mortality. The correlation is -0.7028. That means, the higher the mortality in a country, the shorter life expectancy. This applies to both developed countries and developing countries.

In [55]:
lifeExpec['Life_Expectancy'].corr(lifeExpec['Infant_Deaths'])
Out[55]:
-0.30072840781414983
In [56]:
ax = plt.axes()
sns.scatterplot(data = lifeExpec, y = lifeExpec['Life_Expectancy'], x = lifeExpec['Infant_Deaths'],hue = "Status")
plt.ylabel('Life Expectancy', fontsize = 15,fontweight='bold')
plt.xlabel('Infant Death', fontsize = 15,fontweight='bold')
ax.set_title("Life Expectancy vs Infant Death",  fontsize = 20,fontweight='bold')
Out[56]:
Text(0.5, 1.0, 'Life Expectancy vs Infant Death')

From above analysis, the correlation between life Expectancy and Infant Death is relatively low, -0.3007. That means, the higher life expectancy, the relatively lower infant death rate. But the correlation is low. There is an interesting phenomenia observed from correlation graph. For developed country, there is almost no relationship between life expectancy and infant death. Only for developing country the infant death may have low relationship with life expectancy. Further analysis are done below.

In [57]:
## divide country into developed and developing counties
lifeExpecDevelping = lifeExpec[lifeExpec["Status"] == "Developing"]
lifeExpecDevelped = lifeExpec[lifeExpec["Status"] == "Developed"]
lifeExpecDevelping.shape
Out[57]:
(2080, 22)
In [58]:
ax = plt.axes()
sns.scatterplot(data = lifeExpecDevelping, y = lifeExpecDevelping['Life_Expectancy'], x = lifeExpecDevelping['Infant_Deaths'],hue = "Country")
plt.ylabel('Life Expectancy', fontsize = 15,fontweight='bold')
plt.xlabel('Infant Death', fontsize = 15,fontweight='bold')
ax.set_title("Life Expectancy vs Infant Death",  fontsize = 20,fontweight='bold')

plt.legend(loc='upper left', bbox_to_anchor=(-0.2, -0.06),fancybox=True, shadow=True, ncol=5)
Out[58]:
<matplotlib.legend.Legend at 0x7f44cb133220>
In [59]:
lifeExpecDevelping['Life_Expectancy'].corr(lifeExpecDevelping['Infant_Deaths'])
Out[59]:
-0.27211689706120556
In [60]:
lifeExpecDevelped['Life_Expectancy'].corr(lifeExpecDevelped['Infant_Deaths'])
Out[60]:
-0.07492971195058892

Indeed, some countries, the life expectancy has high correlation with infant death. For these country, lower the infant death could be one factor increase the whole country life expectancy.

In [61]:
lifeExpec['Life_Expectancy'].corr(lifeExpec['Under_Five_Deaths'])
Out[61]:
-0.20293222227457036
In [62]:
ax = plt.axes()
sns.scatterplot(data = lifeExpec, y = lifeExpec['Life_Expectancy'], x = lifeExpec['Under_Five_Deaths'],hue = "Status")
plt.ylabel('Life Expectancy', fontsize = 15,fontweight='bold')
plt.xlabel('Under Five Deaths', fontsize = 15,fontweight='bold')
ax.set_title("Life Expectancy vs Under Five Deaths",  fontsize = 20,fontweight='bold')
Out[62]:
Text(0.5, 1.0, 'Life Expectancy vs Under Five Deaths')
In [63]:
lifeExpecDevelping['Life_Expectancy'].corr(lifeExpecDevelping['Under_Five_Deaths'])
Out[63]:
-0.18168016593909464
In [64]:
lifeExpecDevelped['Life_Expectancy'].corr(lifeExpecDevelped['Under_Five_Deaths'])
Out[64]:
-0.046446267100599514

There is very low relationship with life expectancy and under five death. For developed countries, there is no correlation between life expectancy and under five death.

In [65]:
ax = plt.axes()
sns.scatterplot(data = lifeExpecDevelping, y = lifeExpecDevelping['Life_Expectancy'], x = lifeExpecDevelping['Under_Five_Deaths'],hue = "Country")
plt.ylabel('Life Expectancy', fontsize = 15,fontweight='bold')
plt.xlabel('Under Five Death', fontsize = 15,fontweight='bold')
ax.set_title("Life Expectancy vs Under Five Death",  fontsize = 20,fontweight='bold')

plt.legend(loc='upper left', bbox_to_anchor=(-0.2, -0.06),fancybox=True, shadow=True, ncol=5)
Out[65]:
<matplotlib.legend.Legend at 0x7f44ca36b7f0>

Only for a few counties, life expectancy and under five death have higher correlations. For these countries, lower the under five death could improve the life expectancy. We can do further analysis on these countries.

Immunization Factors

In [66]:
lifeExpec['Life_Expectancy'].corr(lifeExpec['HepatitisB'])
Out[66]:
0.28371010671292163
In [67]:
ax = plt.axes()
sns.scatterplot(data = lifeExpec, y = lifeExpec['Life_Expectancy'], x = lifeExpec['HepatitisB'],hue = "Status")
plt.ylabel('Life Expectancy', fontsize = 15,fontweight='bold')
plt.xlabel('HepatitisB', fontsize = 15,fontweight='bold')
ax.set_title("Life Expectancy vs HepatitisB",  fontsize = 20,fontweight='bold')
Out[67]:
Text(0.5, 1.0, 'Life Expectancy vs HepatitisB')
In [68]:
lifeExpecDevelped['Life_Expectancy'].corr(lifeExpecDevelped['HepatitisB'])
Out[68]:
-0.1853476684251502
In [69]:
lifeExpecDevelping['Life_Expectancy'].corr(lifeExpecDevelping['HepatitisB'])
Out[69]:
0.3086335488127775
In [70]:
lifeExpec['Life_Expectancy'].corr(lifeExpec['Polio'])
Out[70]:
0.4478834851578934
In [71]:
ax = plt.axes()
sns.scatterplot(data = lifeExpec, y = lifeExpec['Life_Expectancy'], x = lifeExpec['Polio'],hue = "Status")
plt.ylabel('Life Expectancy', fontsize = 15,fontweight='bold')
plt.xlabel('Polio', fontsize = 15,fontweight='bold')
ax.set_title("Life Expectancy vs Polio",  fontsize = 20,fontweight='bold')
Out[71]:
Text(0.5, 1.0, 'Life Expectancy vs Polio')
In [72]:
lifeExpecDevelped['Life_Expectancy'].corr(lifeExpecDevelped['Polio'])
Out[72]:
0.03109132846878883
In [73]:
lifeExpecDevelping['Life_Expectancy'].corr(lifeExpecDevelping['Polio'])
Out[73]:
0.4190747242357623
In [74]:
ax = plt.axes()
sns.scatterplot(data = lifeExpecDevelping, y = lifeExpecDevelping['Life_Expectancy'], x = lifeExpecDevelping['Polio'],hue = "Country")
plt.ylabel('Life Expectancy', fontsize = 15,fontweight='bold')
plt.xlabel('Polio', fontsize = 15,fontweight='bold')
ax.set_title("Life Expectancy vs Polio",  fontsize = 20,fontweight='bold')

plt.legend(loc='upper left', bbox_to_anchor=(-0.2, -0.06),fancybox=True, shadow=True, ncol=5)
Out[74]:
<matplotlib.legend.Legend at 0x7f44ca086430>
In [75]:
lifeExpec['Life_Expectancy'].corr(lifeExpec['Diphtheria'])
Out[75]:
0.46693551383309634
In [76]:
ax = plt.axes()
sns.scatterplot(data = lifeExpec, y = lifeExpec['Life_Expectancy'], x = lifeExpec['Diphtheria'],hue = "Status")
plt.ylabel('Life Expectancy', fontsize = 15,fontweight='bold')
plt.xlabel('Diphtheria', fontsize = 15,fontweight='bold')
ax.set_title("Life Expectancy vs Diphtheria",  fontsize = 20,fontweight='bold')
Out[76]:
Text(0.5, 1.0, 'Life Expectancy vs Diphtheria')
In [77]:
lifeExpecDevelped['Life_Expectancy'].corr(lifeExpecDevelped['Diphtheria'])
Out[77]:
-0.007436451781031039
In [78]:
lifeExpecDevelping['Life_Expectancy'].corr(lifeExpecDevelping['Diphtheria'])
Out[78]:
0.45340852569448836
In [79]:
ax = plt.axes()
sns.scatterplot(data = lifeExpecDevelping, y = lifeExpecDevelping['Life_Expectancy'], x = lifeExpecDevelping['Diphtheria'],hue = "Country")
plt.ylabel('Life Expectancy', fontsize = 15,fontweight='bold')
plt.xlabel('Polio', fontsize = 15,fontweight='bold')
ax.set_title("Life Expectancy vs Diphtheria",  fontsize = 20,fontweight='bold')

plt.legend(loc='upper left', bbox_to_anchor=(-0.2, -0.06),fancybox=True, shadow=True, ncol=5)
Out[79]:
<matplotlib.legend.Legend at 0x7f44c9c6d3a0>
In [80]:
lifeExpec['Life_Expectancy'].corr(lifeExpec['HIV/AIDS'])
Out[80]:
-0.5890648695404345
In [81]:
lifeExpecDevelped['Life_Expectancy'].corr(lifeExpecDevelped['HIV/AIDS'])
Out[81]:
2.001504491858226e-15
In [82]:
lifeExpecDevelping['Life_Expectancy'].corr(lifeExpecDevelping['HIV/AIDS'])
Out[82]:
-0.6031286231500611
In [ ]:
 

Economic Factors

In [83]:
lifeExpec['Life_Expectancy'].corr(lifeExpec['Percentage_Exp'])
Out[83]:
0.4046144562099111
In [84]:
ax = plt.axes()
sns.scatterplot(data = lifeExpec, y = lifeExpec['Life_Expectancy'], x = lifeExpec['Percentage_Exp'],hue = "Status")
plt.ylabel('Life Expectancy', fontsize = 15,fontweight='bold')
plt.xlabel('Percentage Expenditure', fontsize = 15,fontweight='bold')
ax.set_title("Life Expectancy vs Percentage Expenditure",  fontsize = 15,fontweight='bold')
Out[84]:
Text(0.5, 1.0, 'Life Expectancy vs Percentage Expenditure')
In [85]:
lifeExpecDevelped['Life_Expectancy'].corr(lifeExpecDevelped['Percentage_Exp'])
Out[85]:
0.40277100295503654
In [86]:
lifeExpec['Life_Expectancy'].corr(lifeExpec['Tot_Exp'])
Out[86]:
0.16856648640209512
In [87]:
ax = plt.axes()
sns.scatterplot(data = lifeExpec, y = lifeExpec['Life_Expectancy'], x = lifeExpec['Tot_Exp'],hue = "Status")
plt.ylabel('Life Expectancy', fontsize = 15,fontweight='bold')
plt.xlabel('Total Expediture', fontsize = 15,fontweight='bold')
ax.set_title("Life Expectancy vs Total Expenditure",  fontsize = 15,fontweight='bold')
Out[87]:
Text(0.5, 1.0, 'Life Expectancy vs Total Expenditure')
In [88]:
lifeExpecDevelped['Life_Expectancy'].corr(lifeExpecDevelped['Tot_Exp'])
Out[88]:
0.15034523637417097
In [89]:
lifeExpecDevelping['Life_Expectancy'].corr(lifeExpecDevelping['Tot_Exp'])
Out[89]:
0.078351517530003
In [90]:
ax = plt.axes()
sns.scatterplot(data = lifeExpecDevelped, y = lifeExpecDevelped['Life_Expectancy'], x = lifeExpecDevelped['Tot_Exp'],hue = "Country")
plt.ylabel('Life Expectancy', fontsize = 15,fontweight='bold')
plt.xlabel('Total Expediture', fontsize = 15,fontweight='bold')
ax.set_title("Life Expectancy vs Total Expediture in Developed Countries",  fontsize = 20,fontweight='bold')
plt.legend(loc='upper left', bbox_to_anchor=(-0.2, -0.06),fancybox=True, shadow=True, ncol=5)
Out[90]:
<matplotlib.legend.Legend at 0x7f44c81ae4c0>
In [91]:
lifeExpec['Life_Expectancy'].corr(lifeExpec['GDP'])
Out[91]:
0.5604577174382323
In [92]:
ax = plt.axes()
sns.scatterplot(data = lifeExpec, y = lifeExpec['Life_Expectancy'], x = lifeExpec['GDP'],hue = "Status")
plt.ylabel('Life Expectancy', fontsize = 15,fontweight='bold')
plt.xlabel('GDP', fontsize = 15,fontweight='bold')
ax.set_title("Life Expectancy vs GDP",  fontsize = 20,fontweight='bold')
Out[92]:
Text(0.5, 1.0, 'Life Expectancy vs GDP')
In [93]:
lifeExpecDevelped['Life_Expectancy'].corr(lifeExpecDevelped['GDP'])
Out[93]:
0.5902593874691077
In [94]:
lifeExpecDevelping['Life_Expectancy'].corr(lifeExpecDevelping['GDP'])
Out[94]:
0.46697167697892705
In [95]:
ax = plt.axes()
sns.scatterplot(data = lifeExpecDevelped, y = lifeExpecDevelped['Life_Expectancy'], x = lifeExpecDevelped['GDP'],hue = "Country")
plt.ylabel('Life Expectancy', fontsize = 15,fontweight='bold')
plt.xlabel('GDP', fontsize = 15,fontweight='bold')
ax.set_title("Life Expectancy vs GDP in Developed Countries",  fontsize = 20,fontweight='bold')
plt.legend(loc='upper left', bbox_to_anchor=(-0.2, -0.06),fancybox=True, shadow=True, ncol=5)
Out[95]:
<matplotlib.legend.Legend at 0x7f44bafbe2e0>
In [96]:
ax = plt.axes()
sns.scatterplot(data = lifeExpecDevelping, y = lifeExpecDevelping['Life_Expectancy'], x= lifeExpecDevelping['GDP'],hue = "Country")
plt.ylabel('Life Expectancy', fontsize = 15,fontweight='bold')
plt.xlabel('GDP', fontsize = 15,fontweight='bold')
ax.set_title("Life Expectancy vs GDP in Developing Countries",  fontsize = 20,fontweight='bold')
plt.legend(loc='upper left', bbox_to_anchor=(-0.2, -0.06),fancybox=True, shadow=True, ncol=5)
Out[96]:
<matplotlib.legend.Legend at 0x7f44baf983a0>
In [97]:
lifeExpec['Life_Expectancy'].corr(lifeExpec['Income_Comp_Of_Resources'])
Out[97]:
0.6936189219089646
In [98]:
ax = plt.axes()
sns.scatterplot(data = lifeExpec, y= lifeExpec['Life_Expectancy'],x = lifeExpec['Income_Comp_Of_Resources'],hue = "Status")
plt.ylabel('Life Expectancy', fontsize = 10,fontweight='bold')
plt.xlabel('Income Composition of Resources', fontsize = 10,fontweight='bold')
ax.set_title("Life Expectancy vs Income Composition of Resources",  fontsize = 12,fontweight='bold')
Out[98]:
Text(0.5, 1.0, 'Life Expectancy vs Income Composition of Resources')
In [99]:
ax = plt.axes()
sns.scatterplot(data = lifeExpecDevelped, y = lifeExpecDevelped['Life_Expectancy'], x = lifeExpecDevelped['Income_Comp_Of_Resources'],hue = "Country")
plt.ylabel('Life Expectancy', fontsize = 15,fontweight='bold')
plt.xlabel('Income Composition of Resources', fontsize = 15,fontweight='bold')
ax.set_title("Life Expectancy vs Income Composition of Resources in Developed Countries",  fontsize = 12,fontweight='bold')
plt.legend(loc='upper left', bbox_to_anchor=(-0.2, -0.06),fancybox=True, shadow=True, ncol=5)
Out[99]:
<matplotlib.legend.Legend at 0x7f44baad73d0>
In [100]:
ax = plt.axes()
sns.scatterplot(data = lifeExpecDevelping, y = lifeExpecDevelping['Life_Expectancy'], x = lifeExpecDevelping['Income_Comp_Of_Resources'],hue = "Country")
plt.ylabel('Life Expectancy', fontsize = 15,fontweight='bold')
plt.xlabel('Income Composition of Resources', fontsize = 15,fontweight='bold')
ax.set_title("Life Expectancy vs Income Composition of Resources in Developed Countries",  fontsize = 12,fontweight='bold')
plt.legend(loc='upper left', bbox_to_anchor=(-0.2, -0.06),fancybox=True, shadow=True, ncol=5)
Out[100]:
<matplotlib.legend.Legend at 0x7f44baa232b0>
In [101]:
lifeExpecDevelped['Life_Expectancy'].corr(lifeExpecDevelped['Income_Comp_Of_Resources'])
Out[101]:
0.7240377050245497
In [102]:
lifeExpecDevelping['Life_Expectancy'].corr(lifeExpecDevelping['Income_Comp_Of_Resources'])
Out[102]:
0.6240240378189589

Social factors

  • schooling
  • Population
In [103]:
lifeExpec['Life_Expectancy'].corr(lifeExpec['Schooling'])
Out[103]:
0.7336048712147273
In [104]:
ax = plt.axes()
sns.scatterplot(data = lifeExpec, y= lifeExpec['Life_Expectancy'],x = lifeExpec['Schooling'],hue = "Status")
plt.ylabel('Life Expectancy', fontsize = 10,fontweight='bold')
plt.xlabel('Schooling', fontsize = 10,fontweight='bold')
ax.set_title("Life Expectancy vs Schooling",  fontsize = 15,fontweight='bold')
Out[104]:
Text(0.5, 1.0, 'Life Expectancy vs Schooling')
In [105]:
lifeExpecDevelped['Life_Expectancy'].corr(lifeExpecDevelped['Schooling'])
Out[105]:
0.3826873649636899
In [106]:
lifeExpecDevelping['Life_Expectancy'].corr(lifeExpecDevelping['Schooling'])
Out[106]:
0.6808867178385944
In [107]:
ax = plt.axes()
sns.scatterplot(data = lifeExpecDevelping, y = lifeExpecDevelping['Life_Expectancy'], x = lifeExpecDevelping['Schooling'],hue = "Country")
plt.ylabel('Life Expectancy', fontsize = 15,fontweight='bold')
plt.xlabel('Schooling', fontsize = 15,fontweight='bold')
ax.set_title("Life Expectancy vs Schooling",  fontsize = 12,fontweight='bold')
plt.legend(loc='upper left', bbox_to_anchor=(-0.2, -0.06),fancybox=True, shadow=True, ncol=5)
Out[107]:
<matplotlib.legend.Legend at 0x7f44ba6dd490>
In [108]:
lifeExpec['Life_Expectancy'].corr(lifeExpec['Population'])
Out[108]:
0.013048459586362375
In [109]:
ax = plt.axes()
sns.scatterplot(data = lifeExpec, y= lifeExpec['Life_Expectancy'],x = lifeExpec['Population'],hue = "Status")
plt.ylabel('Life Expectancy', fontsize = 10,fontweight='bold')
plt.xlabel('Population', fontsize = 10,fontweight='bold')
ax.set_title("Life Expectancy vs Population",  fontsize = 15,fontweight='bold')
Out[109]:
Text(0.5, 1.0, 'Life Expectancy vs Population')
In [110]:
lifeExpecDevelped['Life_Expectancy'].corr(lifeExpecDevelped['Population'])
Out[110]:
0.21012123801021867
In [111]:
lifeExpecDevelping['Life_Expectancy'].corr(lifeExpecDevelping['Population'])
Out[111]:
0.04027574734935439
  • Alcohol
  • BMI
  • thinness_1to19_years
  • thinness_5to9_years
  • HIV/AIDS
  • Measles
In [112]:
lifeExpec['Life_Expectancy'].corr(lifeExpec['Measles'])
Out[112]:
-0.1512301020702818
In [113]:
lifeExpecDevelped['Life_Expectancy'].corr(lifeExpecDevelped['Measles'])
Out[113]:
-0.05078699602836095
In [114]:
lifeExpecDevelping['Life_Expectancy'].corr(lifeExpecDevelping['Measles'])
Out[114]:
-0.13596003298456444
In [115]:
ax = plt.axes()
sns.scatterplot(data = lifeExpec, y = lifeExpec['Life_Expectancy'], x = lifeExpec['Alcohol'],hue = "Status")
plt.ylabel('Life Expectancy', fontsize = 15,fontweight='bold')
plt.xlabel('Alcohol', fontsize = 15,fontweight='bold')
ax.set_title("Life Expectancy vs Alcohol",  fontsize = 20,fontweight='bold')
Out[115]:
Text(0.5, 1.0, 'Life Expectancy vs Alcohol')
In [116]:
lifeExpecDevelped['Life_Expectancy'].corr(lifeExpecDevelped['Alcohol'])
Out[116]:
-0.20407095274156897
In [117]:
lifeExpecDevelping['Life_Expectancy'].corr(lifeExpecDevelping['Alcohol'])
Out[117]:
0.17239461539516923
In [118]:
ax = plt.axes()
sns.scatterplot(data = lifeExpecDevelped, y = lifeExpecDevelped['Life_Expectancy'], x = lifeExpecDevelped['Schooling'],hue = "Country")
plt.ylabel('Life Expectancy', fontsize = 15,fontweight='bold')
plt.xlabel('Schooling', fontsize = 15,fontweight='bold')
ax.set_title("Life Expectancy vs Schooling in Developed Countries",  fontsize = 12,fontweight='bold')
plt.legend(loc='upper left', bbox_to_anchor=(-0.2, -0.06),fancybox=True, shadow=True, ncol=5)
Out[118]:
<matplotlib.legend.Legend at 0x7f44ba21f760>
In [119]:
lifeExpec['Life_Expectancy'].corr(lifeExpec['BMI'])
Out[119]:
0.5722808697013191
In [120]:
ax = plt.axes()
sns.scatterplot(data = lifeExpec, y = lifeExpec['Life_Expectancy'], x = lifeExpec['BMI'],hue = "Status")
plt.ylabel('Life Expectancy', fontsize = 15,fontweight='bold')
plt.xlabel('BMI', fontsize = 15,fontweight='bold')
ax.set_title("Life Expectancy vs BMI",  fontsize = 20,fontweight='bold')
Out[120]:
Text(0.5, 1.0, 'Life Expectancy vs BMI')
In [121]:
lifeExpecDevelping['Life_Expectancy'].corr(lifeExpecDevelping['BMI'])
Out[121]:
0.5626730056861268
In [122]:
lifeExpecDevelped['Life_Expectancy'].corr(lifeExpecDevelped['BMI'])
Out[122]:
-0.006094497369848113
In [123]:
ax = plt.axes()
sns.scatterplot(data = lifeExpecDevelping, y = lifeExpecDevelping['Life_Expectancy'], x = lifeExpecDevelping['BMI'],hue = "Country")
plt.ylabel('Life Expectancy', fontsize = 15,fontweight='bold')
plt.xlabel('BMI', fontsize = 15,fontweight='bold')
ax.set_title("Life Expectancy vs BMI in Developing Countries",  fontsize = 12,fontweight='bold')
plt.legend(loc='upper left', bbox_to_anchor=(-0.2, -0.06),fancybox=True, shadow=True, ncol=5)
Out[123]:
<matplotlib.legend.Legend at 0x7f44ba0d0dc0>
In [124]:
lifeExpec['Life_Expectancy'].corr(lifeExpec['thinness_10to19_years'])
Out[124]:
-0.442014798070637
In [125]:
ax = plt.axes()
sns.scatterplot(data = lifeExpec, y = lifeExpec['Life_Expectancy'], x = lifeExpec['thinness_10to19_years'],hue = "Status")
plt.ylabel('Life Expectancy', fontsize = 15,fontweight='bold')
plt.xlabel('thinness 10 to 19 years', fontsize = 15,fontweight='bold')
ax.set_title("Life Expectancy vs thinness 10 to 19 years",  fontsize = 15,fontweight='bold')
Out[125]:
Text(0.5, 1.0, 'Life Expectancy vs thinness 10 to 19 years')
In [126]:
lifeExpecDevelped['Life_Expectancy'].corr(lifeExpecDevelped['thinness_10to19_years'])
Out[126]:
-0.6152969464491912
In [127]:
lifeExpecDevelping['Life_Expectancy'].corr(lifeExpecDevelping['thinness_10to19_years'])
Out[127]:
-0.35332547973205
In [128]:
ax = plt.axes()
sns.scatterplot(data = lifeExpecDevelped, y = lifeExpecDevelped['Life_Expectancy'], x = lifeExpecDevelped['Schooling'],hue = "Country")
plt.ylabel('Life Expectancy', fontsize = 15,fontweight='bold')
plt.xlabel('Schooling', fontsize = 15,fontweight='bold')
ax.set_title("Life Expectancy vs Schooling in Developed Countries",  fontsize = 12,fontweight='bold')
plt.legend(loc='upper left', bbox_to_anchor=(-0.2, -0.06),fancybox=True, shadow=True, ncol=5)
Out[128]:
<matplotlib.legend.Legend at 0x7f44b9c59040>
In [129]:
lifeExpec['Life_Expectancy'].corr(lifeExpec['thinness_5to9_years'])
Out[129]:
-0.43521407962159886
In [130]:
ax = plt.axes()
sns.scatterplot(data = lifeExpec, y = lifeExpec['Life_Expectancy'], x = lifeExpec['thinness_5to9_years'],hue = "Status")
plt.ylabel('Life Expectancy', fontsize = 15,fontweight='bold')
plt.xlabel('thinness 5 to 9 years', fontsize = 15,fontweight='bold')
ax.set_title("Life Expectancy vs thinness 5 to 9 years",  fontsize = 15,fontweight='bold')
Out[130]:
Text(0.5, 1.0, 'Life Expectancy vs thinness 5 to 9 years')
In [131]:
lifeExpec['Life_Expectancy'].corr(lifeExpec['HIV/AIDS'])
Out[131]:
-0.5890648695404345
In [132]:
lifeExpecDevelped['Life_Expectancy'].corr(lifeExpecDevelped['HIV/AIDS'])
Out[132]:
2.001504491858226e-15
In [133]:
lifeExpecDevelping['Life_Expectancy'].corr(lifeExpecDevelping['HIV/AIDS'])
Out[133]:
-0.6031286231500611
In [134]:
# Life_Expectancy through years
plt.figure(figsize=(6,6))
plt.bar(lifeExpec.groupby('Year')['Year'].count().index,lifeExpec.groupby('Year')['Life_Expectancy'].mean(),color='cornflowerblue',alpha=0.65)
plt.xlabel("Year",fontsize=12)
plt.ylabel("Avg Life_Expectancy",fontsize=12)
plt.title("Life_Expectancy vs Year",fontweight='bold')
plt.show()
In [ ]:
 
In [135]:
# Life_Expectancy through years
plt.figure(figsize=(6,6))
plt.bar(lifeExpecDevelped.groupby('Year')['Year'].count().index,lifeExpecDevelped.groupby('Year')['Life_Expectancy'].mean(),color='cornflowerblue',alpha=0.65)
plt.xlabel("Year",fontsize=12)
plt.ylabel("Avg Life_Expectancy",fontsize=12)
plt.title("Life_Expectancy in developed countries vs Year",fontweight='bold')
plt.show()
In [136]:
# Life_Expectancy through years
plt.figure(figsize=(6,6))
plt.bar(lifeExpecDevelping.groupby('Year')['Year'].count().index,lifeExpecDevelping.groupby('Year')['Life_Expectancy'].mean(),color='cornflowerblue',alpha=0.65)
plt.xlabel("Year",fontsize=12)
plt.ylabel("Avg Life_Expectancy",fontsize=12)
plt.title("Life_Expectancy in developing countries vs Year",fontweight='bold')
plt.show()
In [137]:
## check correlation within features
lifeExpec['thinness_5to9_years'].corr(lifeExpec['thinness_10to19_years'])
Out[137]:
0.9344957183969055
In [138]:
lifeExpec['Income_Comp_Of_Resources'].corr(lifeExpec['Schooling'])
Out[138]:
0.7697843270545522
In [139]:
lifeExpec['Infant_Deaths'].corr(lifeExpec['Under_Five_Deaths'])
Out[139]:
0.5492535243315712
In [140]:
lifeExpec['GDP'].corr(lifeExpec['Percentage_Exp'])
Out[140]:
0.7118520842653635

models

  • linear regression
  • random forest
In [141]:
## fit linear regression model
response = lifeExpecDevelped["Life_Expectancy"]
predictors = lifeExpecDevelped[['Adult_Mortality','Percentage_Exp', 'GDP','Income_Comp_Of_Resources','Schooling','Population','thinness_10to19_years']]
In [142]:
model =sm.OLS(response,predictors).fit()
##print (model.params)
print (model.summary())
                                 OLS Regression Results                                
=======================================================================================
Dep. Variable:        Life_Expectancy   R-squared (uncentered):                   0.999
Model:                            OLS   Adj. R-squared (uncentered):              0.999
Method:                 Least Squares   F-statistic:                          3.360e+04
Date:                Sat, 21 Nov 2020   Prob (F-statistic):                        0.00
Time:                        19:34:01   Log-Likelihood:                         -822.38
No. Observations:                 330   AIC:                                      1659.
Df Residuals:                     323   BIC:                                      1685.
Df Model:                           7                                                  
Covariance Type:            nonrobust                                                  
============================================================================================
                               coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------------
Adult_Mortality              0.0012      0.004      0.322      0.747      -0.006       0.009
Percentage_Exp           -4.223e-07   6.19e-05     -0.007      0.995      -0.000       0.000
GDP                      -8.933e-05   1.25e-05     -7.145      0.000      -0.000   -6.47e-05
Income_Comp_Of_Resources   120.6666      2.920     41.321      0.000     114.922     126.412
Schooling                   -1.2052      0.135     -8.905      0.000      -1.471      -0.939
Population                2.523e-10    8.1e-09      0.031      0.975   -1.57e-08    1.62e-08
thinness_10to19_years       -0.7654      0.258     -2.964      0.003      -1.273      -0.257
==============================================================================
Omnibus:                       57.094   Durbin-Watson:                   0.651
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               86.756
Skew:                           1.062   Prob(JB):                     1.45e-19
Kurtosis:                       4.340   Cond. No.                     4.74e+08
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.74e+08. This might indicate that there are
strong multicollinearity or other numerical problems.
In [143]:
response = lifeExpecDevelped["Life_Expectancy"]
predictors1 = lifeExpecDevelped[['Adult_Mortality', 'GDP','Population','thinness_10to19_years']]
In [144]:
model1 =sm.OLS(response,predictors1).fit()
print (model1.summary())
                                 OLS Regression Results                                
=======================================================================================
Dep. Variable:        Life_Expectancy   R-squared (uncentered):                   0.931
Model:                            OLS   Adj. R-squared (uncentered):              0.930
Method:                 Least Squares   F-statistic:                              1101.
Date:                Sat, 21 Nov 2020   Prob (F-statistic):                   6.41e-188
Time:                        19:34:01   Log-Likelihood:                         -1468.7
No. Observations:                 330   AIC:                                      2945.
Df Residuals:                     326   BIC:                                      2961.
Df Model:                           4                                                  
Covariance Type:            nonrobust                                                  
=========================================================================================
                            coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------
Adult_Mortality           0.1878      0.025      7.627      0.000       0.139       0.236
GDP                       0.0009   3.87e-05     24.447      0.000       0.001       0.001
Population             3.738e-07   5.28e-08      7.079      0.000     2.7e-07    4.78e-07
thinness_10to19_years    15.2630      1.445     10.561      0.000      12.420      18.106
==============================================================================
Omnibus:                        8.570   Durbin-Watson:                   0.371
Prob(Omnibus):                  0.014   Jarque-Bera (JB):                8.663
Skew:                          -0.396   Prob(JB):                       0.0131
Kurtosis:                       3.060   Cond. No.                     3.32e+07
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.32e+07. This might indicate that there are
strong multicollinearity or other numerical problems.
In [145]:
from sklearn.model_selection import train_test_split
trainDataDeveloped = lifeExpecDevelped.drop(["Country", "Year",'Status'],axis=1)
##XDeveloped = trainDataDeveloped.drop(['Life_Expectancy'],axis=1)
XDeveloped = lifeExpecDevelped[['Adult_Mortality','Percentage_Exp', 'GDP','Income_Comp_Of_Resources','Schooling','Population','thinness_10to19_years']]
yDeveloped = trainDataDeveloped["Life_Expectancy"]
X_train, X_test, y_train, y_test = train_test_split(XDeveloped, yDeveloped, test_size=0.30, random_state=101)
In [146]:
from sklearn.linear_model import LinearRegression
In [147]:
Linear_model= LinearRegression()
In [148]:
Linear_model.fit(X_train,y_train)
Out[148]:
LinearRegression()
In [149]:
predictions1=Linear_model.predict(X_test)
In [150]:
Linear_model.coef_
Out[150]:
array([-6.44447964e-03,  5.20756207e-05, -2.00878260e-05,  5.40249732e+01,
       -4.65965926e-01,  7.79537727e-09, -1.43523735e+00])
In [151]:
lifeExpec.columns
Out[151]:
Index(['Country', 'Year', 'Status', 'Life_Expectancy', 'Adult_Mortality',
       'Infant_Deaths', 'Alcohol', 'Percentage_Exp', 'HepatitisB', 'Measles',
       'BMI', 'Under_Five_Deaths', 'Polio', 'Tot_Exp', 'Diphtheria',
       'HIV/AIDS', 'GDP', 'Population', 'thinness_10to19_years',
       'thinness_5to9_years', 'Income_Comp_Of_Resources', 'Schooling'],
      dtype='object')
In [152]:
trainData = lifeExpec.drop(["Country", "Year",'Status'],axis=1)
X = trainData.drop(['Life_Expectancy'],axis=1)
y = trainData["Life_Expectancy"]
In [153]:
trainData2 = lifeExpec.drop(["Country", "Year",'Status','Schooling','thinness_10to19_years'],axis=1)
X = trainData2.drop(['Life_Expectancy'],axis=1)
y = trainData2["Life_Expectancy"]
In [154]:
trainData2.columns
Out[154]:
Index(['Life_Expectancy', 'Adult_Mortality', 'Infant_Deaths', 'Alcohol',
       'Percentage_Exp', 'HepatitisB', 'Measles', 'BMI', 'Under_Five_Deaths',
       'Polio', 'Tot_Exp', 'Diphtheria', 'HIV/AIDS', 'GDP', 'Population',
       'thinness_5to9_years', 'Income_Comp_Of_Resources'],
      dtype='object')
In [155]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
In [156]:
## normalize data
scaler = StandardScaler() 
data_scaled = scaler.fit_transform(trainData2 )
In [157]:
from sklearn.ensemble import RandomForestRegressor

regressor = RandomForestRegressor(n_estimators=20, random_state=0)
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)
In [158]:
from sklearn import metrics

print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
Mean Absolute Error: 1.1509232365145219
Mean Squared Error: 3.183054927385892
Root Mean Squared Error: 1.7841118034994028
In [159]:
# Get numerical feature importances
importances = list(regressor.feature_importances_)
feature_list = list(X.columns)
# List of tuples with variable and importance
feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(feature_list, importances)]
# Sort the feature importances by most important first
feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)
# Print out the feature and importances 
[print('Variable: {:20} Importance: {}'.format(*pair)) for pair in feature_importances];
Variable: HIV/AIDS             Importance: 0.53
Variable: Income_Comp_Of_Resources Importance: 0.17
Variable: GDP                  Importance: 0.12
Variable: Adult_Mortality      Importance: 0.11
Variable: BMI                  Importance: 0.02
Variable: Alcohol              Importance: 0.01
Variable: Under_Five_Deaths    Importance: 0.01
Variable: Tot_Exp              Importance: 0.01
Variable: Population           Importance: 0.01
Variable: thinness_5to9_years  Importance: 0.01
Variable: Infant_Deaths        Importance: 0.0
Variable: Percentage_Exp       Importance: 0.0
Variable: HepatitisB           Importance: 0.0
Variable: Measles              Importance: 0.0
Variable: Polio                Importance: 0.0
Variable: Diphtheria           Importance: 0.0
In [160]:
trainDataDeveloped = lifeExpecDevelped.drop(["Country", "Year",'Status'],axis=1)
XDeveloped = trainDataDeveloped.drop(['Life_Expectancy'],axis=1)
yDeveloped = trainDataDeveloped["Life_Expectancy"]
In [161]:
## normalize data
scaler = StandardScaler() 
data_scaled = scaler.fit_transform(trainDataDeveloped )
In [162]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(XDeveloped, yDeveloped, test_size=0.2, random_state=0)
In [163]:
regressor = RandomForestRegressor(n_estimators=20, random_state=0)
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)
In [164]:
# Get numerical feature importances
importances = list(regressor.feature_importances_)
feature_list = list(X.columns)
# List of tuples with variable and importance
feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(feature_list, importances)]
# Sort the feature importances by most important first
feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)
# Print out the feature and importances 
[print('Variable: {:20} Importance: {}'.format(*pair)) for pair in feature_importances];
Variable: GDP                  Importance: 0.63
Variable: Adult_Mortality      Importance: 0.07
Variable: HepatitisB           Importance: 0.04
Variable: Alcohol              Importance: 0.03
Variable: BMI                  Importance: 0.03
Variable: Tot_Exp              Importance: 0.02
Variable: Population           Importance: 0.02
Variable: thinness_5to9_years  Importance: 0.02
Variable: Percentage_Exp       Importance: 0.01
Variable: Measles              Importance: 0.01
Variable: Under_Five_Deaths    Importance: 0.01
Variable: Diphtheria           Importance: 0.01
Variable: Income_Comp_Of_Resources Importance: 0.01
Variable: Infant_Deaths        Importance: 0.0
Variable: Polio                Importance: 0.0
Variable: HIV/AIDS             Importance: 0.0
In [165]:
plt.figure(1)
feat_importances = pd.Series(regressor.feature_importances_, index=X_train.columns)
feat_importances.nlargest(5).plot(kind='barh')
plt.title("Feature Importance in Developed Countries", fontWeight = 'bold')
Out[165]:
Text(0.5, 1.0, 'Feature Importance in Developed Countries')
In [166]:
trainDataDeveloping = lifeExpecDevelping.drop(["Country", "Year",'Status'],axis=1)
XDeveloping = trainDataDeveloping.drop(['Life_Expectancy'],axis=1)
yDeveloping = trainDataDeveloping["Life_Expectancy"]
In [167]:
## normalize data
scaler = StandardScaler() 
data_scaled = scaler.fit_transform(trainDataDeveloping )

X_train, X_test, y_train, y_test = train_test_split(XDeveloping, yDeveloping, test_size=0.2, random_state=0)
regressor = RandomForestRegressor(n_estimators=20, random_state=0)
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)

# Get numerical feature importances
importances = list(regressor.feature_importances_)
feature_list = list(X.columns)
# List of tuples with variable and importance
feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(feature_list, importances)]
# Sort the feature importances by most important first
feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)
# Print out the feature and importances 
[print('Variable: {:20} Importance: {}'.format(*pair)) for pair in feature_importances];
Variable: HIV/AIDS             Importance: 0.59
Variable: Adult_Mortality      Importance: 0.13
Variable: GDP                  Importance: 0.09
Variable: BMI                  Importance: 0.02
Variable: Alcohol              Importance: 0.01
Variable: Under_Five_Deaths    Importance: 0.01
Variable: Polio                Importance: 0.01
Variable: Tot_Exp              Importance: 0.01
Variable: Population           Importance: 0.01
Variable: Income_Comp_Of_Resources Importance: 0.01
Variable: Infant_Deaths        Importance: 0.0
Variable: Percentage_Exp       Importance: 0.0
Variable: HepatitisB           Importance: 0.0
Variable: Measles              Importance: 0.0
Variable: Diphtheria           Importance: 0.0
Variable: thinness_5to9_years  Importance: 0.0
In [168]:
plt.figure(1)
feat_importances = pd.Series(regressor.feature_importances_, index=X_train.columns)
feat_importances.nlargest(5).plot(kind='barh')
plt.title("Feature Importance in Developing Countries", fontWeight = 'bold')
Out[168]:
Text(0.5, 1.0, 'Feature Importance in Developing Countries')
In [169]:
## get the lifeexpectancy is lower than 50
lifeExpectLow = lifeExpec[lifeExpec["Life_Expectancy"] < 50]
In [170]:
len(lifeExpectLow["Country"].unique())
Out[170]:
16
In [171]:
trainDataDevelopingLow = lifeExpectLow.drop(["Country", "Year",'Status'],axis=1)
XDevelopingLow = trainDataDevelopingLow.drop(['Life_Expectancy'],axis=1)
yDevelopingLow = trainDataDevelopingLow["Life_Expectancy"]
In [172]:
## normalize data
scaler = StandardScaler() 
data_scaled = scaler.fit_transform(trainDataDevelopingLow )

X_train, X_test, y_train, y_test = train_test_split(XDevelopingLow, yDevelopingLow, test_size=0.2, random_state=0)
regressor = RandomForestRegressor(n_estimators=20, random_state=0)
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)

# Get numerical feature importances
importances = list(regressor.feature_importances_)
feature_list = list(X.columns)
# List of tuples with variable and importance
feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(feature_list, importances)]
# Sort the feature importances by most important first
feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)
# Print out the feature and importances 
[print('Variable: {:20} Importance: {}'.format(*pair)) for pair in feature_importances];
Variable: GDP                  Importance: 0.19
Variable: Adult_Mortality      Importance: 0.16
Variable: BMI                  Importance: 0.13
Variable: Tot_Exp              Importance: 0.11
Variable: HIV/AIDS             Importance: 0.11
Variable: Under_Five_Deaths    Importance: 0.05
Variable: Population           Importance: 0.04
Variable: Alcohol              Importance: 0.02
Variable: Polio                Importance: 0.02
Variable: Diphtheria           Importance: 0.02
Variable: Income_Comp_Of_Resources Importance: 0.02
Variable: Infant_Deaths        Importance: 0.01
Variable: Percentage_Exp       Importance: 0.01
Variable: HepatitisB           Importance: 0.01
Variable: Measles              Importance: 0.01
Variable: thinness_5to9_years  Importance: 0.01
In [173]:
lifeExpectLow.describe()
Out[173]:
Year Life_Expectancy Adult_Mortality Infant_Deaths Alcohol Percentage_Exp HepatitisB Measles BMI Under_Five_Deaths Polio Tot_Exp Diphtheria HIV/AIDS GDP Population thinness_10to19_years thinness_5to9_years Income_Comp_Of_Resources Schooling
count 103.000000 103.000000 103.000000 103.000000 103.000000 103.000000 103.000000 103.000000 103.000000 103.000000 103.000000 103.000000 103.000000 103.000000 103.000000 1.030000e+02 103.000000 103.000000 103.000000 103.000000
mean 2004.213592 46.727184 413.199612 73.844660 3.462330 50.158760 56.019417 10428.543689 18.984466 119.213592 59.893204 6.045534 56.165049 13.830097 1175.215022 1.794700e+07 7.271845 7.147573 0.368893 8.192233
std 3.426112 2.363538 223.961056 136.400776 2.777351 72.987822 27.923909 32408.354300 6.830364 220.934364 26.926885 2.719579 28.698602 12.305888 1238.344371 3.165332e+07 3.766166 3.788265 0.113717 2.104230
min 2000.000000 36.300000 4.000000 2.000000 0.010000 0.000000 6.000000 0.000000 2.200000 4.000000 3.000000 1.120000 4.000000 0.600000 272.991178 1.728340e+06 1.000000 1.000000 0.000000 3.900000
25% 2001.000000 45.450000 367.000000 17.000000 1.500000 8.901421 26.000000 34.000000 15.750000 26.000000 42.000000 4.135000 33.500000 2.600000 417.349084 4.426378e+06 4.050000 4.000000 0.318500 6.250000
50% 2004.000000 47.100000 463.000000 30.000000 2.690000 33.346915 64.000000 649.000000 17.800000 48.000000 66.000000 5.700000 64.000000 9.000000 805.292339 1.006701e+07 8.000000 7.800000 0.387000 8.500000
75% 2006.500000 48.500000 586.500000 49.500000 4.395000 53.210573 82.000000 2739.500000 22.450000 82.000000 84.000000 7.145000 83.000000 23.550000 1230.780371 1.349387e+07 9.600000 9.550000 0.437000 9.900000
max 2014.000000 49.900000 723.000000 576.000000 10.570000 469.582390 99.000000 212183.000000 44.200000 943.000000 99.000000 13.630000 99.000000 43.500000 5542.305427 1.426141e+08 14.300000 14.400000 0.580000 11.900000
In [174]:
plt.figure(1)
feat_importances = pd.Series(regressor.feature_importances_, index=X_train.columns)
feat_importances.nlargest(5).plot(kind='barh')
plt.title("Feature Importance in low Life Expetancy Countries", fontWeight = 'bold')
Out[174]:
Text(0.5, 1.0, 'Feature Importance in low Life Expetancy Countries')
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]: